Audisankara: Machine Learning

SKILL ORIENTED PROGRAMMING ON
MACHINE LEARNING
Submitted to JNTUA in partial fulfillment of the requirement
For the award of the degree of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
Submitted By
B.SRINIVAS ROYAL
(192H1A0514)
AUDISANKARA
INSTITUTE OF TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING
(Accredited By: NAAC|Approved By:AICTE|Affiliated to JNTUA)
NH-5, BYPASS ROAD, GUDUR-524101,TIRUPATI (DT.). ANDHRA PRADESH.
2022-2023
ACKNOWLEDGEMENT
First and Foremost, I would like to thank my beloved parents for their blessings
and grace in making this skill oriented programming success. I avail this
opportunity to express our profound sense of sincere and deep gratitude to
those who constantly guided, supported and encourage during the course of my
skill oriented programming.
I wish to express my heartfelt thanks and deep sense of gratitude to the
honorable chairman Dr.V.PENCHALAIAH for his encouragement and inspiration
throughout the process.
I would like to thank my beloved Director of “AUDISANKARA INSTITUTE OF
TECHNOLOGY” Dr. A. MOHAN creating a competitive environment in our
collage and encouraging throughout this course.
I would like to thank my collage management for having allowed me to do the
project work.Lastly ,I would like to pay our regards and thank our principal
Dr.T.VENU MADHAV whose ideas are proved to be really worth full in our work.
I wish to express our deep sense of gratitude to my beloved and esteemed Head
of the department of CSE, Dr.A.SWARUPA RANI, assoc.Professor. For her
support, encouragement and valuable suggestions, this went a long way in the
successful completion of this skill oriented programming.
DECLARATION
I hereby declare that the skill oriented programming entitled “MACHINE
LEARNING” been successfully completed. This skill oriented programming work
has been submitted to “AUDISANKARA INSTITUTE OF TECHNOLOGY”, GUDUR
as a part of partial fulfillment of the requirements for the award of degree of
bachelor of technology. I also declare that this skill oriented programming
report has not been submitted at any time to another institute or university for
the award of any degree.
B.SRINIVASROYAL
(192H1A0514)
PLACE: GUDUR,
DATE:
INDEX
SL.NO NAME OF THE TOPIC PAGE NO.

1 INTRODUCTION TO MACHINE LEARNING 1
2 CLASSIFICATION OF MACHINE LEARNING 2
3 STRUCTURE OF MACHINE LEARNING 4
4 CLASSIFICATION ALGORITHM IN ML 5
5 LOGISTIC REGRESSION IN ML 7
6 CLUSTERING IN ML 10
7 CLUSTERING ALGORITHM 14
8 DATA PROCESSING 15
9 INTRODUCTION TO DIMENSIONALITY REDUCTION 17
10 STEP BY STEP IMPLEMENTATION IN PYTHON 21
1 INTRODUCTION TO MACHINE LEARNING
Arthur Samuel, an early American leader in the field of computer gaming
and artificial intelligence, coined the term “Machine Learning ” in 1959 while at
IBM. He defined machine learning as “the field of study that gives computers the
ability to learn without being explicitly programmed “.
➢ The field of study known as machine learning is concerned with the
question of how to construct computer programs that automatically
improve with experience.
DEFINITION OF LEARNING:
A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P , if its performance at tasks T,
as measured by P , improves with experience E.
EXAMPLES:
Handwriting learning problem
Task T : Recognizing and classifying handwritten words within images
Performance P : Percent of words correctly classified
experience E : A dataset of handwritten words with given classifications
A robot driving learning problem
Task T : Driving on highways using vision sensors
Performance P : Average distance traveled before an error
Training experience E : A sequence of images and steering commands record
while observing a human driver
DEFINITION:
➢ A computer program which learns from experience is called a machine
learning program or simply a learning program .
1
2. CLASSIFICATION OF MACHINE LEARNING
Machine learning implementations are classified into four major categories,
depending on the nature of the learning “signal” or “response” available to a
learning system which are as follows:
SUPERVISED LEARNING
Supervised learning is the machine learning task of learning a function that maps
an input to an output based on example input-output pairs. The given data is
labeled . Both classification and regression problems are supervised learning
problems .
Example — Consider the following data regarding patients entering a clinic . The
data consists of the gender and age of the patients and each patient is labeled as
“healthy” or “sick”.
Gender age label
M 49 sick
M 67 sick
F 53 healthy
M 49 sick
F 32 healthy
M 34 healthy
M 21 healthy
2
UNSUPERVISED LEARNING
Unsupervised learning is a type of machine learning algorithm used to draw
inferences from datasets consisting of input data without labeled responses. In
unsupervised learning algorithms, classification or categorization is not included
in the observations. Example: Consider the following data regarding patients
entering a clinic. The data consists of the gender and age of the patients.
Gender age
M 48
M 67
F 53
M 49
F 34
M 21
REINFORCEMENT LEARNING
Reinforcement learning is the problem of getting an agent to act in the world so as
to maximize its rewards.
A learner is not told what actions to take as in most forms of machine learning but
instead must discover which actions yield the most reward by trying them. For
example — Consider teaching a dog a new trick: we cannot tell it what tell it to do
what to do, but we can reward/punish it if it does the right/wrong thing.
UNSUPERVISED LEARNING
Where an incomplete training signal is given: a training set with some (often many)
of the target outputs missing. There is a special case of this principle known as
Transduction where the entire set of problem instances is known at learning time,
except that part of the targets are missing. Semi-supervised learning is an approach
to machine learning that combines small labeled data with a large amount of
unlabeled data during training. Semi-supervised learning falls between
unsupervised learning and supervised learning.
3
3. STRUCTURE OF MACHINE LEARNING
Structured machine learning refers to gaining knowledge of established
hypotheses from statistics with rich inner structure typically with inside one or
greater relations. In general, the statistics might encompass shown inputs in
addition to outputs, components of which can be uncertain, noisy, or missing.
USES OF MACHINE LEARNING:

➢ Finance
➢ Health
➢ Government
➢ Stores
➢ Oil and gas
➢ Transport
Structure of Machine Learning
4
4. CLASSIFICATION ALGORITHM IN MACHINE LEARNING
The Classification algorithm is a Supervised Learning technique that is
used to identify the category of new observations on the basis of training data. In
Classification, a program learns from the given dataset or observations and then
classifies new observation into a number of classes or groups. Such as, Yes or No, 0
or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or
categories. Unlike regression, the output variable of Classification is a category, not
a value, such as “Green or Blue”, “fruit or animal”, etc. Since the Classification
algorithm is a Supervised learning technique, hence it takes labeled input data,
which means it contains input with the corresponding output.
In classification algorithm a discrete output function (y) is mapped to input variable
(x)
Y=f(x) , where u=categorical output
The main goal of the Classification algorithm is to identify the category of a given
dataset, and these algorithms are mainly used to predict the output for the
categorical data.
Classification algorithms can be better understood using the below diagram. In the
below diagram, there are two classes, class A and Class B. These classes have
features that are similar to each other and dissimilar to other classes.
5
The algorithm which implements the classification on a dataset is known as a
classifier. There are two types of Classifications:
Binary Classifier: If the classification problem has only two possible outcomes,
then it is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
Multi-class Classifier: If a classification problem has more than two outcomes,
then it is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
TYPES OF ML CLASSIFICATIONS:
Classification Algorithms can be further divided into the Mainly two category:
➢ Linear Models
Logistic Regression
Support Vector Machines
➢ Non-linear Models
K-Nearest Neighbours
Kernal SVM
Naïve Bayes
Decision Tree Classification
Random Forest Classifications
6
5. LOGISTIC REGRESSION IN MACHINE LEARNING
Logistic regression is one of the most popular Machine Learning algorithms,
which comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be either Yes
or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it
gives the probabilistic values which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems, whereas
Logistic regression is used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an “S” shaped
logistic function, which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
etc.
Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the
classification. The below image is showing the logistic function:
7
LOGISTIC FUNCTION:
• The sigmoid function is a mathematical function used to map the predicted
values to probabilities.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which cannot
go beyond this limit, so it forms a curve like the “S” form. The S-form curve
is called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which
defines the probability of either 0 or 1. Such as values above the threshold
value tends to 1, and a value below the threshold values tends to 0.
ASSUMPTIONS:
• The dependent variable must be categorical in nature.
• The independent variable should not have multi-collinearity.
LOGISTIC REGRESSION EQUATION:

The Logistic regression equation can be obtained from the Linear
Regression equation. The mathematical steps to get Logistic Regression
equations are given below:
➔ We know the equation of the straight line can be written as:
8
Y= b0+b1x1+b2x2+b3x3+……+bnxn
➔ In Logistic Regression y can be between 0 and 1 only, so for this let’s divide
the above equation by (1-y):
Y/1-y ;0 for y0, and infinity for y=1
➔ But we need range between –[infinity] to +[infinity], then take logarithm of
the equation it will become:
Log[y/1-y] = b0+b1x1+b2x2+b3x3+………bnxn
TYPES OF LOGICAL REGRESSION:

On the basis of the categories, Logistic Regression can be classified into three
types:
Binomial: In binomial Logistic regression, there can be only two possible types of
the dependent variables, such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep”
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as “low”, “Medium”, or “High”.
9
6. CLUSTERING IN MACHINE LEARNING
Clustering or cluster analysis is a machine learning technique, which groups the
unlabelled dataset. It can be defined as “A way of grouping the data points into
different clusters, consisting of similar data points. The objects with the possible
similarities remain in a group that has less or no similarities with another group.”
It does it by finding some similar patterns in the unlabelled dataset such as shape,
size, color, behavior, etc., and divides them as per the presence and absence of
those similar patterns.
It is an unsupervised learning method, hence no supervision is provided to the
algorithm, and it deals with the unlabeled dataset.
After applying this clustering technique, each cluster or group is provided with a
cluster-ID. ML system can use this id to simplify the processing of large and complex
datasets.
The clustering technique is commonly used for statistical data analysis.
The clustering technique can be widely used in various tasks. Some most common
uses of this technique are:
• Market Segmentation
• Statistical data analysis
• Social network analysis
• Image segmentation
• Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation
system to provide the recommendations as per the past search of products. Netflix
also uses this technique to recommend the movies and web-series to its users as
per the watch history.
The below diagram explains the working of the clustering algorithm. We can see
the different fruits are divided into several groups with similar properties.
10
TYPES OF CLUSTERING:
The clustering methods are broadly divided into Hard clustering (datapoint
belongs to only one group) and Soft Clustering (data points can belong to another
group also). But there are also other various approaches of Clustering exist. Below
are the main clustering methods used in Machine learning:
• Partitioning Clustering
• Density-Based Clustering
• Distribution Model-Based Clustering
• Hierarchical Clustering
• Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also
known as the centroid-based method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define
the number of pre-defined groups. The cluster center is created in such a way that
the distance between the data points of one cluster is minimum as compared to
another cluster centroid.
11
Density -Based Clustering
The density-based clustering method connects the highly-dense areas into clusters,
and the arbitrarily shaped distributions are formed as long as the dense region can
be connected. This algorithm does it by identifying different clusters in the dataset
and connects the areas of high densities into clusters. The dense areas in data space
are divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has
varying densities and high dimensions.
The distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is
done by assuming some distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that
uses Gaussian Mixture Models (GMM).
Distribution Model-Based Clustering:
In the distribution model-based clustering method, the data is divided based

on the probability of how a dataset belongs to a particular distribution. The
grouping is done by assuming some distributions commonly Gaussian
Distribution.
The example of this type is the Expectation-Maximization Clustering

algorithm that uses Gaussian Mixture Models (GMM).
Hierarchical Clustering:
Hierarchical clustering can be used as an alternative for the partitioned

clustering as there is no requirement of pre-specifying the number of clusters
to be created. In this technique, the dataset is divided into clusters to create
a tree-like structure, which is also called a dendrogram. The observations or
any number of clusters can be selected by cutting the tree at the correct level.
The most common example of this method is the Agglomerative Hierarchical
algorithm.
12
Fuzzy Clustering:
Fuzzy clustering is a type of soft method in which a data object may belong
to more than one group or cluster. Each dataset has a set of membership
coefficients, which depend on the degree of membership to be in a cluster.
Fuzzy C-means algorithm is the example of this type of clustering; it is
sometimes also known as the Fuzzy k-means algorithm.
13
7. Clustering Algorithms
The Clustering algorithms can be divided based on their models that are
explained above. There are different types of clustering algorithms published,
but only a few are commonly used. The clustering algorithm is based on the
kind of data that we are using. Such as, some algorithms need to guess the
number of clusters in the given dataset, whereas some are required to find
the minimum distance between the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely
used in machine learning:
K-Means algorithm: The k-means algorithm is one of the most popular

clustering algorithms. It classifies the dataset by dividing the samples into
different clusters of equal variances. The number of clusters must be specified
in this algorithm. It is fast with fewer computations required, with the linear
complexity of O(n).
Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in

the smooth density of data points. It is an example of a centroid-based model,
that works on updating the candidates for centroid to be the center of the
points within a given region.
DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of

Applications with Noise. It is an example of a density-based model similar to
the mean-shift, but with some remarkable advantages. In this algorithm, the
areas of high density are separated by the areas of low density. Because of
this, the clusters can be found in any arbitrary shape.
Expectation-Maximization Clustering using GMM: This algorithm can be

used as an alternative for the k-means algorithm or for those cases where K-
means can be failed. In GMM, it is assumed that the data points are Gaussian
distributed.
Agglomerative Hierarchical algorithm: The Agglomerative hierarchical

algorithm performs the bottom-up hierarchical clustering. In this, each data
point is treated as a single cluster at the outset and then successively merged.
The cluster hierarchy can be represented as a tree-structure
Affinity Propagation: It is different from other clustering algorithms as it

does not require to specify the number of clusters. In this, each data point
sends a message between the pair of data points until convergence. It has
O(N2T) time complexity, which is the main drawback of this algorithm.
14
8. DATA PROCESSING
Data Processing is the task of converting data from a given form to a much
more usable and desired form i.e. making it more meaningful and informative.
Using Machine Learning algorithms, mathematical modeling, and statistical
knowledge, this entire process can be automated. The output of this complete
process can be in any desired form like graphs, videos, charts, tables, images,
and many more, depending on the task we are performing and the
requirements of the machine. This might seem to be simple but when it comes
to massive organizations like Twitter, Facebook, Administrative bodies like
Parliament, UNESCO, and health sector organizations, this entire process
needs to be performed in a very structured manner. So, the steps to perform
are as follows:
Collection :
The most crucial step when starting with ML is to have data of good quality and
accuracy. Data can be collected from any authenticated source
like Kaggle or UCI dataset repository. For example, while preparing for a
competitive exam, students study from the best study material that they can
access so that they learn the best to obtain the best results. In the same way,
high-quality and accurate data will make the learning process of the model
easier and better and at the time of testing, the model would yield state-of-
the-art results. A huge amount of capital, time and resources are consumed in
collecting data. Organizations or researchers have to decide what kind of data
they need to execute their tasks or research.
Example: Working on the Facial Expression Recognizer, needs numerous
images having a variety of human expressions. Good data ensures that the
results of the model are valid and can be trusted upon.
15
Preparation
The collected data can be in a raw form which can’t be directly fed to the
machine. So, this is a process of collecting datasets from different sources,
analyzing these datasets and then constructing a new dataset for further
processing and exploration. This preparation can be performed either
manually or from the automatic approach. Data can also be prepared in
numeric forms also which would fasten the model’s learning.
Example: An image can be converted to a matrix of N X N dimensions, the value
of each cell will indicate the image pixel.
Input
Now the prepared data can be in the form that may not be machine-readable,
so to convert this data to the readable form, some conversion algorithms are
needed. For this task to be executed, high computation and accuracy is needed.
Example: Data can be collected through the sources like MNIST Digit
data(images), Twitter comments, audio files, video clips.
Processing
This is the stage where algorithms and ML techniques are required to perform
the instructions provided over a large volume of data with accuracy and
optimal computation.
Output
In this stage, results are procured by the machine in a meaningful manner
which can be inferred easily by the user. Output can be in the form of reports,
graphs, videos, etc.
Storage
This is the final step in which the obtained output and the data model data
and all the useful information are saved for future use.
16
9.INTRODUCTION TO DIMENSIONALITY REDUCTION
In machine learning classification problems, there are often too

many factors on the basis of which the final classification is done.
These factors are basically variables called features. The higher the
number of features, the harder it gets to visualize the training set
and then work on it. Sometimes, most of these features are
correlated, and hence redundant. This is where dimensionality
reduction algorithms come into play. Dimensionality reduction is the
process of reducing the number of random variables under
consideration, by obtaining a set of principal variables. It can be
divided into feature selection and feature extraction.
Why is Dimensionality Reduction important in Machine Learning

and Predictive Modeling?
An intuitive example of dimensionality reduction can be discussed

through a simple e-mail classification problem, where we need to
classify whether the e-mail is spam or not. This can involve a large
number of features, such as whether or not the e-mail has a generic
title, the content of the e-mail, whether the e-mail uses a template,
etc. However, some of these features may overlap. In another
condition, a classification problem that relies on both humidity and
rainfall can be collapsed into just one underlying feature, since both
of the aforementioned are correlated to a high degree. Hence, we
can reduce the number of features in such problems. A 3-D
classification problem can be hard to visualize, whereas a 2-D one
can be mapped to a simple 2 dimensional space, and a 1-D problem
to a simple line. The below figure illustrates this concept, where a 3-
D feature space is split into two 2-D feature spaces, and later, if
found to be correlated, the number of features can be reduced even
further.
17
Components of Dimensionality Reduction
There are two components of dimensionality reduction:
Feature selection: In this, we try to find a subset of the original set

of variables, or features, to get a smaller subset which can be used
to model the problem. It usually involves three ways:
• Filter
• Wrapper
• Embedded
Feature extraction: This reduces the data in a high dimensional

space to a lower dimension space, i.e. a space with lesser no. of
dimensions.
Methods of Dimensionality Reduction:
The various methods used for dimensionality reduction include:
• Principal Component Analysis (PCA)

• Linear Discriminant Analysis (LDA)
• Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear or non-linear,

depending upon the method used. The prime linear method, called
Principal Component Analysis, or PCA, is discussed below.
Principal Component Analysis
This method was introduced by Karl Pearson. It works on a condition

that while the data in a higher dimensional space is mapped to data
in a lower dimension space, the variance of the data in the lower
dimensional space should be maximum.
It involves the following steps:
18
• Construct the covariance matrix of the data.
• Compute the eigenvectors of this matrix.
• Eigenvectors corresponding to the largest eigenvalues are
used to reconstruct a large fraction of variance of the
original data.
Hence, we are left with a lesser number of eigenvectors, and there

might have been some data loss in the process. But, the most
important variances should be retained by the remaining
eigenvectors.
Advantages of Dimensionality Reduction:
• It helps in data compression, and hence reduced storage

space.
• It reduces computation time.
• It also helps remove redundant features, if any.
Disadvantages of Dimensionality Reduction:
• It may lead to some amount of data loss.

• PCA tends to find linear correlations between variables,
which is sometimes undesirable.
• PCA fails in cases where mean and covariance are not
enough to define datasets.
• We may not know how many principal components to
keep- in practice, some thumb rules are applied.
Why do we prefer Python to implement machine

learning algorithms?
Python is a popular and general-purpose programming
language. We can write machine learning algorithms using
Python, and it works well. The reason why Python is so popular
among data scientists is that Python has a diverse variety of
modules and libraries already implemented that make our life
more comfortable.
19
Let us have a brief look at some exciting Python libraries.
Numpy: It is a math library to work with n-dimensional arrays in

Python. It enables us to do computations effectively and
efficiently.
Spicy: It is a collection of numerical algorithms and domain-

specific tool-box, including signal processing, optimization,
statistics, and much more. Scipy is a functional library for

scientific and high-performance computations.
Scikit-learn: It is a free machine learning library for python

programming language. It has most of the classification,
regression, and clustering algorithms, and works with Python
numerical libraries such as Numpy, Scipy.
Matplotlib: It is a trendy plotting package that provides 2D

plotting as well as 3D plotting.
20
10. STEP BY STEP IMPLEMENTATION IN PYTHON
Import required libraries:
Since we are going to use various libraries for calculations, we need to

import them.
Read the CSV file
We check the first five rows of our dataset. In this case, we are using a vehicle
model dataset — please check out the dataset on Softlayer IBM.
21
Select the features we want to consider in predicting values
Here our goal is to predict the value of “co2 emissions” from the value
“engine size” in our dataset.
Plot the data
We can visualize our data on a scatter plot.
Divide the data into training and testing data
To check the accuracy of a model, we are going to divide our data into
training and testing datasets. We will use training data to train our model,
and then we will check the accuracy of our model using the testing dataset.
22
Training our model
Here is how we can train our model and find the coefficients for our best-fit
regression line.
Plot the best fit line
Based on the coefficients, we can plot the best fit line for our dataset.
23
Prediction function
We are going to use a prediction function for our testing dataset.
24
Predicting co2 emissions
Predicting the values of co2 emissions based on the regression line.
Checking accuracy for test data
We can check the accuracy of a model by comparing the actual values with
the predicted values in our dataset.
25
Put it all together
#Import required libraries:
Import pandas as pd
Import numpy as np
Import matplotlib.pyplot as plt
From sklearn import linear_model
# Read the CSV file :
Data = pd.read_csv(“Fuel.csv”)
Data.head()
# Let’s select some features to explore more :
Data = data[[“ENGINESIZE”,”CO2EMISSIONS”]]
# ENGINESIZE vs CO2EMISSIONS:
Plt.scatter(data[“ENGINESIZE”] , data[“CO2EMISSIONS”] , color=”blue”)
Plt.xlabel(“ENGINESIZE”)
Plt.ylabel(“CO2EMISSIONS”)
Plt.show()
# Generating training and testing data from our data:
# We are using 80% data for training.
Train = data[int((len(data)*0.8)))]
Test = data[(int((len(data)*0.8))):]
# Modeling:
26
# Using sklearn package to model data :
Regr = linear_model.LinearRegression()
Train_x = np.array(train[[“ENGINESIZE”]])
Train_y = np.array(train[[“CO2EMISSIONS”]])
Regr.fit(train_x,train_y)
# The coefficients:
Print (“coefficients : “,regr.coef_) #Slope
Print (“Intercept : “,regr.intercept_) #Intercept
# Plotting the regression line:
Plt.scatter(train[“ENGINESIZE”], train[“CO2EMISSIONS”], color=’blue’)
Plt.plot(train_x, regr.coef_*train_x + regr.intercept_, ‘-r’)
Plt.xlabel(“Engine size”)
Plt.ylabel(“Emission”)
# Predicting values:
# Function for predicting future values :
Def get_regression_predictions(input_features,intercept,slope):
Predicted_values = input_features*slope + intercept
Return predicted_values
# Predicting emission for future car:
27
My_engine_size = 3.5
Estimatd_emission =
get_regression_predictions(my_engine_size,regr.intercept_[0],regr.coef_[0
][0])
Print (“Estimated Emission :”,estimatd_emission)
# Checking various accuracy:
From sklearn.metrics import r2_score
Test_x = np.array(test[[‘ENGINESIZE’]])
Test_x = np.array(test[[‘CO2EMISSIONS’]])
Test_y_ = regr.predict(test_x)
Print(“Mean absolute error: %.2f” % np.mean(np.absolute(test_y_ —

test_y)))
Print(“Mean sum of squares (MSE): %.2f” % np.mean((test_y_ — test_y) **

2))
Print(“R2-score: %.2f” % r2_score(test_y_ , test_y) )
28

Audisankara: Machine Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Audisankara: Machine Learning

Uploaded by

Copyright:

Available Formats

SKILL ORIENTED PROGRAMMING ON

SL.NO NAME OF THE TOPIC PAGE NO.

USES OF MACHINE LEARNING:

Structure of Machine Learning

LOGISTIC REGRESSION EQUATION:

TYPES OF LOGICAL REGRESSION:

In the distribution model-based clustering method, the data is divided based

The example of this type is the Expectation-Maximization Clustering

Hierarchical clustering can be used as an alternative for the partitioned

K-Means algorithm: The k-means algorithm is one of the most popular

Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in

DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of

Expectation-Maximization Clustering using GMM: This algorithm can be

Agglomerative Hierarchical algorithm: The Agglomerative hierarchical

Affinity Propagation: It is different from other clustering algorithms as it

In machine learning classification problems, there are often too

Why is Dimensionality Reduction important in Machine Learning

An intuitive example of dimensionality reduction can be discussed

There are two components of dimensionality reduction:

Feature selection: In this, we try to find a subset of the original set

Feature extraction: This reduces the data in a high dimensional

Methods of Dimensionality Reduction:

The various methods used for dimensionality reduction include:

• Principal Component Analysis (PCA)

Dimensionality reduction may be both linear or non-linear,

Principal Component Analysis

This method was introduced by Karl Pearson. It works on a condition

It involves the following steps:

Hence, we are left with a lesser number of eigenvectors, and there

Advantages of Dimensionality Reduction:

• It helps in data compression, and hence reduced storage

Disadvantages of Dimensionality Reduction:

• It may lead to some amount of data loss.

Why do we prefer Python to implement machine

Numpy: It is a math library to work with n-dimensional arrays in

Spicy: It is a collection of numerical algorithms and domain-

statistics, and much more. Scipy is a functional library for

Scikit-learn: It is a free machine learning library for python

Matplotlib: It is a trendy plotting package that provides 2D

Import required libraries:

Since we are going to use various libraries for calculations, we need to

Read the CSV file

Plot the data

We can visualize our data on a scatter plot.

Divide the data into training and testing data

Plot the best fit line

We are going to use a prediction function for our testing dataset.

Predicting the values of co2 emissions based on the regression line.

Checking accuracy for test data

the predicted values in our dataset.

Import matplotlib.pyplot as plt

From sklearn import linear_model

# Read the CSV file :

# Let’s select some features to explore more :

Plt.scatter(data[“ENGINESIZE”] , data[“CO2EMISSIONS”] , color=”blue”)

# Generating training and testing data from our data:

# We are using 80% data for training.

Print (“coefficients : “,regr.coef_) #Slope

Print (“Intercept : “,regr.intercept_) #Intercept

# Plotting the regression line:

Plt.scatter(train[“ENGINESIZE”], train[“CO2EMISSIONS”], color=’blue’)

Plt.plot(train_x, regr.coef_*train_x + regr.intercept_, ‘-r’)