Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 24

MACHINE LEARNING

ASSIGNMENT-1
NAME : RAMAN MAURYA
REGNO : 21DBCAD023
Decision Tree Classifier
 A decision tree is a non-parametric supervised learning algorithm, which is utilized
for both classification and regression tasks. It has a hierarchical, tree structure,
which consists of a root node, branches, internal nodes and leaf nodes.

 Its primary purpose is to create a model that predicts the target variable (a
categorical label in classification problems) based on a set of input features.

 It does this by recursively partitioning the input data into subsets, making decisions
at each step, ultimately forming a tree-like structure where the leaves represent the
class labels or predicted values.
Purpose of Decision Tree Classifier

 Classification : Decision trees are commonly used for classification tasks, where
the goal is to predict a categorical target variable. This could include tasks like spam
detection, disease diagnosis, sentiment analysis, or customer churn prediction.

 Interpretability: Decision trees are highly interpretable models, making them


valuable in situations where understanding why a particular prediction was made is
essential. They can be visualized and easily understood by humans.

 Feature Importance: Decision trees can implicitly rank the importance of input
features, allowing you to identify which features are most relevant for making
predictions.
Export_text
 The export_text function in scikit-learn is used to generate a textual representation
of a decision tree model, which can be useful for understanding how the tree makes
predictions.

 This textual representation provides information about the decision rules at each
node in the tree, including feature names, threshold values, and class predictions.

 It's a valuable tool for model interpretation and debugging.


Purpose of Export_text

 Model Interpretation: export_text allows you to generate a human-readable


representation of a decision tree model. This can help you understand how the model is
making decisions based on input features, making it easier to interpret and explain the
model's behavior to stakeholders or domain experts.

 Debugging: When working with decision tree models, especially deep or complex
ones, export_text can be used for debugging. You can visually inspect the tree structure
and decision rules to identify potential issues or sources of misclassification.
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
iris = load_iris()
X = iris['data’]
y = ['setosa']*50+['versicolor']*50+['virginica']*50
decision_tree = DecisionTreeClassifier(random_state=0, max_depth=3)
decision_tree = decision_tree.fit(X, y)
from sklearn.tree.export import export_text
r = export_text(decision_tree, feature_names=iris['feature_names'],decimals=0, show_weights=True)
print(r)

Output:
Seaborn
 Seaborn is a Python data visualization library based on Matplotlib.

 Its primary purpose is to provide a high-level interface for creating attractive and
informative statistical graphics.

 Seaborn is particularly well-suited for visualizing complex datasets and statistical


relationships, making it a popular choice for data exploration and presentation in data
analysis and machine learning tasks.
Purpose of Seaborn

 Statistical Visualization: Seaborn simplifies the creation of complex statistical


visualizations, allowing you to explore the relationships between variables in your data.
It provides functions for creating informative plots like scatter plots, bar plots,
histograms, box plots, violin plots, and more.

 Aesthetically Pleasing Plots: Seaborn comes with a set of appealing color palettes
and themes that make it easy to create visually pleasing and publication-quality plots
with minimal customization.

 Faceted Data Exploration: Seaborn provides built-in support for faceted data
exploration, allowing you to create multi-plot grids to examine interactions between
variables more easily.
import numpy as np
import seaborn as sns
sns.set(style="white")
rs = np.random.RandomState(10)
d = rs.normal(size=100)
sns.histplot(d, kde=True, color="m")

Output : histogram with seaborn


Astype
 The astype method in Pandas is used to change the data type of one or more columns
in a DataFrame.

 Its primary purpose is to allow you to explicitly specify the data type for columns in your
DataFrame.

 It can be useful for data type conversion, data manipulation, and data analysis tasks.
Purpose of astype

 Data Type Conversion: The primary purpose of astype is to change the data type of
one or more columns in a DataFrame. This can be helpful when you need to ensure
that a column has the correct data type for your analysis or when you want to convert
data from one type to another (e.g., from a string to a numeric type).
 Memory Optimization: By changing data types, you can reduce the memory usage of
your DataFrame. For example, converting integer columns to smaller integer types or
using float32 instead of float64 can lead to significant memory savings, which can be
crucial when dealing with large datasets.
 Data Cleaning: It can be used to clean data by converting erroneous or inconsistent
data into the correct data type. For example, converting string representations of
numbers into actual numeric types.
import pandas as pd
data = {'A': ['1', '2', '3’], 'B': [4.1, 5.2,
6.3]}
df = pd.DataFrame(data)
print("Initial Data Types:")
print(df.dtypes)
df['A'] = df['A'].astype(int)
df['B'] = df['B'].astype('float32')
print("\nUpdated Data Types:")
print(df.dtypes)

Output :
cat.codes
 In Pandas, the cat.codes attribute is used to obtain the category codes (or integer
codes) of the values in a categorical or "category" data type column.

 The purpose of cat.codes is to provide a way to represent categorical data as integers,


which can be useful for various data manipulation and analysis tasks.
Purpose of cat.codes

 Numerical Representation: cat.codes allows you to represent categorical data as


integers, making it easier to work with such data in numerical operations and analyses.
This can be especially valuable when you need to use categorical data in machine
learning models that require numerical inputs.
 Memory Efficiency: Integer representations are more memory-efficient compared to
storing the actual categories as strings or objects, which can be important when dealing
with large datasets.
 Sorting and Grouping: You can use the integer codes for sorting and grouping data.
For instance, you can sort a DataFrame by a categorical column based on its codes.
import pandas as pd
data = {'Category': ['A', 'B', 'C', 'A', 'C', 'B', 'B']}
df = pd.DataFrame(data)
df['Category'] = df['Category'].astype('category’)
category_codes = df['Category'].cat.codes
df['Category Codes'] = category_codes
print(df)

Output :
Classification Report
 The classification report is a tool in machine learning for evaluating the performance of
a classification model. Its primary purpose is to provide a detailed summary of the
model's performance in terms of various evaluation metrics for each class or category in
a classification problem.
 It's particularly useful for understanding how well a model is performing across different
classes and can help in identifying where the model may be making mistakes.
 The classification report typically includes metrics such as precision, recall, F1-score,
and support for each class, along with an overall accuracy score. These metrics are
valuable for assessing a model's performance in tasks like binary classification, multi-
class classification, and multi-label classification.
Purpose of classification Report

 Detailed Evaluation: The classification report provides a detailed breakdown of model


performance, helping you understand how well the model is performing for each class.
This is especially important in imbalanced datasets where some classes may have
fewer examples.
 Model Comparison: It allows you to compare the performance of different models or
algorithms on the same dataset. You can use the report to choose the model that best
suits your specific classification task.
 Error Analysis: The report helps in identifying which classes the model is good at
predicting and which ones it struggles with, which can guide further data collection,
preprocessing, or model improvement efforts.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
clf = DecisionTclf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
report = classification_report(y_test, y_pred, target_names=iris.target_names)
print("Classification Report:")
print(report)

Output :
Standard Scaler
 The Standard Scaler is a preprocessing technique commonly used in machine learning
to scale and center numerical features (variables) in a dataset.
 Its primary purpose is to transform the features so that they have a mean of 0 and a
standard deviation of 1.
 Standardization (also known as z-score normalization) is particularly useful when
dealing with features that have different scales or units because it ensures that all
features have the same scale.
 It can improve the performance of many machine learning algorithms, especially those
sensitive to the scale of features.
Purpose of Standard Scaler

 Scale Features: Standard Scaler scales each feature independently, transforming them
to have a mean of 0 and a standard deviation of 1. This scaling makes it easier to
compare and interpret the impact of different features on a model.
 Improve Model Performance: Many machine learning algorithms, such as support
vector machines, k-nearest neighbors and principal component analysis, perform better
when features are standardized. Standardization helps prevent features with larger
scales from dominating the learning process.
 Normalize Distributions: Standard Scaler can help normalize the feature distributions,
making the data more suitable for models that assume normality.
import numpy as np
from sklearn.preprocessing import StandardScaler
data = np.array([[10.0, 5.0], [20.0, 10.0], [30.0, 15.0]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print("Original Data:")
print(data)
print("\nScaled Data:")
print(scaled_data)

Output :
Label Encoding
 Label encoding is a technique used to convert categorical data (data that consists of
labels or categories) into numerical format.

 Its primary purpose is to prepare categorical data for machine learning algorithms that
require numerical inputs.

 Label encoding assigns a unique integer value to each category, effectively converting
them into numeric labels.
Purpose of Label Encoding

 Numeric Representation: Label encoding converts categorical data into a numeric


format, which is essential for many machine learning algorithms that work with
numerical data.

 Preserve Ordinal Information: Label encoding can be useful for ordinal categorical
data, where the order or ranking among categories matters. The assigned integers
retain the ordinal relationship between the categories.
from sklearn.preprocessing import LabelEncoder
data = ['red', 'green', 'blue', 'green', 'red']
label_encoder = LabelEncoder()
encoded_data = label_encoder.fit_transform(data)
print("Original Data:")
print(data)
print("\nEncoded Data:")
print(encoded_data)
decoded_data =
label_encoder.inverse_transform(encoded_data)
print("\nDecoded Data:")
print(decoded_data)

Output :

You might also like