Industrialtrainingrepor12 210313013306

INDUSTRIAL TRAINING REPORT
Machine Learning
Submitted
In Partial Fulfillment of the Requirements
For the Degree of
Bachelor of Technology
In
Computer Science and Engineering
By
HRJEET SINGH
Roll No. 1700410019
2017-2021
2021
Sponsored by
Internshala
Noida Gurgaon U.P

Table of Content
Declaration i
Certificate ii
Acknowledgement iii
Abstract iv
1.0 Introduction ….
2.0 Company Background & Structure ….

3.0 Weekly Job Summary ….
3.1 Daily Records
3.2 About the Training
3.3 Training Schedule and location
4.0 Technical Contents ….

4.1 Description of tasks
5.0 Learning outcome and work experience ….

5.1 Application of Theory and skills
6.0 Conclusion
References
Declaration
I hereby declare that I have completed my six weeks summer

training at Internshala(one of the world’s leading online
certification training providers) from 24th Nov, 2020 to 05th Jan,
2020 under the guidance of Mr. Kunal Jain and Mr. Sunil Roy. I
have declared that I have worked with full dedication during these
six weeks of training and my learning outcomes fulfill the
requirements of training for the award of degree of Bachelor of
Technology (B.Tech.) in Computer Science and Engineering,
Raja Balwant Singh Engineering Technical Campus
Name : Hrjeet Singh
Roll.No. 1700410019
Date:
CERTIFICATE
Prof.(Dr.) Brajesh Kumar Singh

H.O.D. CSE Deptt.
ACKNOWLEDGEMENT
I would like to acknowledgement the contribution of the following people
without whose help and guidance this report would not have completed .
I acknowledgement the counsel and support of our training coordinator, Mr.

Brajesh Kumar singh, Head of CSE Department, with respect and gratitude,
whose expertise, guidance, support, encouragement, and enthusiasm has
made this report possible . their feedback vastly improve the quality of this
report and provided an enthralling experience. I am indeed proud and
fortunate to be supported by him.
Although it is not possible to name individually, I shall ever remain indebted
to the faculty members of R.B.S. Engineering Technical Campus Bichpuri,
Agra for their persistent support and cooperation extended during this work.
This acknowledgement will remain incomplete if I fail to express our deep
sense of obligation to my parents and God for their consistent blessings and
encouragement.
Hrjeet singh
1700410019
Abstract
Industrial training is an important phase of a student life. A well
planned, properly executed and evaluated industrial training helps a
lot in developing a professional attitude. It develop an awareness of
industrial approach to problem solving, based on a broad
understanding of process and mode of operation of organization.
The aim and motivation of this industrial training is to receive
discipline, skills, teamwork and technical knowledge through a
proper training environment, which will help me, as a student in the
field of Information Technology, to develop a responsiveness of the
self-disciplinary nature of problems in information and
communication technology
Company Background & structure
Company profile
Internshala was created with a mission to create skilled software engineers
for our country and the world. It aims to bridge the gap between the quality of
demanded by industry and the quality of skills imparted by conventional
institute. With assessments, learning paths and courses authored by industry
experts, Internshala help businesses and individual benchmark expertise
across roles, speed up release cycles and build reliable, secure products.
VISION
We are a technology company on a mission to equip students with relevant
skills & practical exposure through internships and online trainings. Imagine a
world full of freedom and possibilities. A world where you can discover your
passion and turn it into your career. A world where your practical skills
matter more than your university degree. A world where you do not have to
wait till 21 to taste your first work experience (and get a rude shock that it is
nothing like you had imagine it to be). A world where you graduate fully
assured, fully confident, and fully prepared to stake claim on your place in the
world.
History
The platform, which was founded in 2010, started out as a WordPress blog
that aggregated internships across India and articles on education, technology
and skill gap. Internshala launched its online trainings in 2014. As of 2018,
the platform had 3.5 million students and 80,000 companies.
Mission
nternshala's mission is to equip every student with practical skills and

exposure so that they can build their dream careers. And our e-learning
platform, Internshala Trainings ( https://trainings.internshala.com) is central
to this mission. Internshala Trainings' goal is simple - to make learning easy.
Objectives
Main objective of training were to learn:
 How to determine and measure program complexity.

 Python programming.
 Machine learning Library Scikit Learn, Numpy , Matplotlib , Pandas ,
Seaborn.
 Statistical Math for the Algorithms.
 Learning to solve and Mathematical concepts.
 Supervised and Unsupervised learning.
 Classification and Regression.
 Machine learning algorithms.
 Machine Learning Programming and Use Cases.
Weekly Summery
Week 1 Introduction to machine learning, Introduction to

data, Assignment 1, assignment 2
Week 2 Introduction to python and Data Exploration and

Preprocessing. Assingment 3, Assignment 4.
Week 3 Linear Regression and Introduction to

Dimensionality Reduction. Assingment 5,
Assignment 6.
Week 4 Logistic Regression and Decision Tree.
Assingment7, Assignment 8.
Week 5 Ensemble model. Assingment 9.
Week 6 Clustering, project. Assingment 10.

About the training
Training is the process of teaching, informing or educating people so
that they may become well qualified as possible to do their job, and
they become qualified to perform in positions of greater difficulty and
responsibility.
Training is an organized and planned effort by a company in order to
facilitate employees learning regarding job related competencies.
 Industrial training at Internshala from 24th November 2020 to

05th January 2021.
 I completed my online industrial training from “Internshala”
located in Gurgaon whose time period was of 42 days.
I have completed my online training under the guidance MR. Kunal Jain
and MR. Sunil Roy
Introduction To Machine Learning
Machine learning enables a machine to automatically learn from
data, improve performance from experiences, and predict things
without being explicitly programmed.
In the real world, we are surrounded by humans who can learn
everything from their experiences with their learning capability, and
we have computers or machines which work on our instructions.
But can a machine also learn from experiences or past data like a
human does? So here comes the role of Machine Learning.
Type of machine learning

The types of machine learning algorithm differ in their approach, the
type of data they input and output, and the type of task or problem
that they are intended to solve. Broadly machine learning can be
categorized into two categories,
I. Supervised Learning
II. Unsupervised Learning
Supervised Learning
Supervised Learning is a type of learning in witch we are given a
data set and we already know what are correct output should look
like, having the idea that there is a relationship between the input
and output. Basically, it is learning task of learning a function that
maps an input to an output based on example input-output pair.
Unsupervised Learning
Unsupervised learning is a type of learning that allow us to
approach problems with little or no idea our problem should look
like. We can derive the structure by clustering the data based on
relationship among the variables in data. With unsupervised
learning there is no feedback based on prediction result. Basically , it
is a type of self-organized learning that help in finding previously
unknown patterns in the data set without pre-existing label.
Data
Data is collection of information about any things.
Ex. Notification, Activity of time, Clock alarm etc.
Two type of data use in machine learning models,
1 Labeled data
2 Unlabeled data
Labeled data
The data which contain a target variable or an output variable that
answer a question of interest is called labeled data.
Unlabeled data
Unlabeled data is a designation for pieces of data that have not been
tagged with labels identifying characteristics, properties or
classifications.
Introduction to Python
Python is a widely used general-purpose, high level programming
language. It was initially designed by Guido van Rossum in 1991 and
developed by Python Software Foundation. It was mainly developed
for an emphasis on code readability, and its syntax allows
programmers to express concepts in fewer lines of code. Python is
dynamically typed and garbage-collected. It supports
multiple programming paradigms, including procedural, object-
oriented, and functional programming. Python is often described as
a "batteries included" language due to its comprehensive standard
library.
Basic Libraries in Python

Scikit-learn for handling basic ML algorithms like clustering, linear and
logistic regressions, regression, classification, and others.
Pandas for high-level data structures and analysis. It allows merging
and filtering of data, as well as gathering it from other external sources
like Excel, for instance.
Matplotlib for creating 2D plots, histograms, charts, and other forms of
visualization.
NumPy is a general-purpose array-processing package. It provides
a high-performance multidimensional array object, and tools for
working with these arrays
Data Preprocessing
Machine learning on’t work so well with processing raw data. Before we can
feed such data to an ML algorithm, we must preprocess it. We must apply
some transformations on it. With data preprocessing, we convert raw data
into a clean data set. To perform data this, there are 6 techniques-
1. Rescaling Data -For data with attributes of varying scales, we can
rescale attributes to possess the same scale. We rescale attributes into
the range 0 to 1 and call it normalization. We use the Min Max Scaler
class from scikit-learn. This gives us values between 0 and 1
2. Normalizing Data -In this task, we rescale each observation to a length

of 1 (a unit norm). For this, we use the Normalizer class.
3. Mean Removal-We can remove the mean from each feature to center it
on zero.
4. Some labels can be words or numbers. Usually, training data is labelled

with words to make it readable. Label encoding converts word labels
into numbers to let algorithms work on them.
5. One Hot Encoding -When dealing with few and scattered numerical
values, we may not need to store these. Then, we
can perform One Hot Encoding. For k distinct values, we can transform t
he feature into a k-dimensionalvector with one value of 1 and 0 as the
rest values.
6. Standardizing Data -With standardizing, we can take attributes with a

Gaussian distribution and different means and standard deviations and
transform them into a standard Gaussian distribution with a mean of 0
and a standard deviation of 1.
Exploratory Data Analysis (EDA)
It is the process of summarizing, visualizing and getting deeply

acquainted with the important traits of a data set. When you carry out
EDA, domain knowledge (e.g. about the business or social impact category)
can help a great deal in understanding the data and extracting insights from
it.
To achieve this level of certainty, here’s what you can do with EDA:
 Understand how the raw data was collected
 Get familiar with different characteristics of the data
 Learn about the individual features and their mutual relationships (or
lack of)
 Check and validate the data for anomalies, outliers, missing values,
human errors, etc.
 Extract insights that weren’t so evident to business stakeholders but can

provide useful information about the business
 Discover hidden patterns in the data that allow for better

comprehension of the business problem
 Validate if the data has been generated in an expected manner

Linear regression
Linear regression may be defined as the statistical model that
analyzes the linear relationship between a dependent variable with
given set of independent variables. Linear relationship between
variables means that when the value of one or more independent
variables will change (increase or decrease), the value of dependent
variable will also change accordingly (increase or decrease).
Mathematically the relationship can be represented with the help of
following equation −
Y = mX + c
Here:
Y=Dependent Variable (Target Variable)
X=Independent Variable (predictor Variable)
C= intercept of the line
m=Linear regression coefficient

Cost function-
o The different values for weights or coefficient of lines (a 0, a1) gives the
different line of regression, and the cost function is used to estimate the
values of the coefficient for the best fit line.
o Cost function optimizes the regression coefficients or weights. It
measures how a linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping
function, which maps the input variable to the output variable. This
mapping function is also known as Hypothesis function.
MAE (Mean absolute error) represents the difference between the original
and predicted values extracted by averaged the absolute difference over the
data set.
MSE (Mean Squared Error) represents the difference between the original and
predicted values extracted by squared the average difference over the data
set.
RMSE (Root Mean Squared Error) is the error rate by the square root of MSE.
R-squared (Coefficient of determination) represents the coefficient of how
well the values fit compared to the original values. The value from 0 to 1
interpreted as percentages. The higher the value is, the better the model is.
The above metrics can be expressed,
,
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating

the gradient of the cost function.
o A regression model uses gradient descent to update the
coefficients of the line by reducing the cost function.
o It is done by a random selection of values of coefficient and
then iteratively update the values to reach the minimum cost
function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set
of observations. The process of finding the best model out of various
models is called optimization. It can be achieved by below method :
1. R-squared method:
o R-squared is a statistical method that determines the goodness

of fit.
o It measures the strength of the relationship between the
dependent and independent variables on a scale of 0-100%.
o The high value of R-square determines the less difference
between the predicted values and actual values and hence
represents a good model.
o It is also called a coefficient of determination, or coefficient
of multiple determination for multiple regression.
o It can be calculated from the below formula:
Assumptions of Linear Regression
Below are some important assumptions of Linear Regression. These are some
formal checks while building a Linear Regression model, which ensures to get
the best possible result from the given dataset.
Linear relationship between the features and target: Linear regression

assumes the linear relationship between the dependent and independent
variables.
Small or no multi collinearity between the features: Multi collinearity

means high-correlation between the independent variables. Due to multi
collinearity, it may difficult to find the true relationship between the
predictors and target variables. Or we can say, it is difficult to determine
which predictor variable is affecting the target variable and which is not. So,
the model assumes either little or no multi collinearity between the features
or independent variables.
Homoscedasticity Assumption: Homoscedasticity is a situation when the

error term is the same for all the values of independent variables. With
homoscedasticity, there should be no clear pattern distribution of data in the
scatter plot.
Normal distribution of error terms: Linear regression assumes that the

error term should follow the normal distribution pattern. If error terms are
not normally distributed, then confidence intervals will become either too
wide or too narrow, which may cause difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without
any deviation, which means the error is normally distributed.
No autocorrelations: The linear regression model assumes no

autocorrelation in error terms. If there will be any correlation in the error
term, then it will drastically reduce the accuracy of the model. Autocorrelation
usually occurs if there is a dependency between residual errors.
Introduction to Dimensionality Reduction.
The number of input features, variables, or columns present in a given dataset

is known as dimensionality, and the process to reduce these features is called
dimensionality reduction.
A dataset contains a huge number of input features in various cases, which

makes the predictive modeling task more complicated. Because it is very
difficult to visualize or make predictions for the training dataset with a high
number of features, for such cases, dimensionality reduction techniques are
required to use.
Dimensionality reduction technique can be defined as, "It is a way of

converting the higher dimensions dataset into lesser dimensions dataset
ensuring that it provides similar information." These techniques are widely
used in machine learning for obtaining a better fit predictive model while
solving the classification and regression problems
Missing Value Ratio : If a dataset has too many missing values, then we drop
those variables as they do not carry much useful information. To perform this,
we can set a threshold level, and if a variable has missing values more than
that threshold, we will drop that variable. The higher the threshold value, the
more efficient the reduction.
Low Variance Filter : As same as missing value ratio technique, data columns
with some changes in the data have less information. Therefore, we need to
calculate the variance of each variable, and all data columns with variance
lower than a given threshold are dropped because low variance features will
not affect the target variable.
High Correlation Filter: High Correlation refers to the case when two
variables carry approximately similar information. Due to this factor, the
performance of the model can be degraded. This correlation between the
independent numerical variable gives the calculated value of the correlation
coefficient. If this value is higher than the threshold value, we can remove one
of the variables from the dataset. We can consider those variables or features
that show a high correlation with the target variable.
Backward Feature Elimination
The backward feature elimination technique is mainly used while developing
Linear Regression or Logistic Regression model. Below steps are performed in
this technique to reduce the dimensionality or in feature selection:
o In this technique, firstly, all the n variables of the given dataset are taken
to train the model.
o The performance of the model is checked.
o Now we will remove one feature each time and train the model on n-1
features for n times, and will compute the performance of the model.
o We will check the variable that has made the smallest or no change in
the performance of the model, and then we will drop that variable or
features; after that, we will be left with n-1 features.
o Repeat the complete process until no feature can be dropped.
In this technique, by selecting the optimum performance of the model and

maximum tolerable error rate, we can define the optimal number of features
require for the machine learning algorithms.
Forward Feature Selection

Forward feature selection follows the inverse process of the backward
elimination process. It means, in this technique, we don't eliminate the
feature; instead, we will find the best features that can produce the highest
increase in the performance of the model. Below steps are performed in this
technique:
o We start with a single feature only, and progressively we will add each
feature at a time.
o Here we will train the model on each feature separately.
o The feature with the best performance is selected.
o The process will be repeated until we get a significant increase in the
performance of the model.
Logistic regression
Logistic regression is one of the most popular Machine Learning

algorithms, which comes under the Supervised Learning technique.
It is used for predicting the categorical dependent variable using a
given set of independent variables.
Logistic regression predicts the output of a categorical dependent

variable. Therefore the outcome must be a categorical or discrete
value. It can be either Yes or No, 0 or 1, true or False, etc. but instead
of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except

that how they are used. Linear Regression is used for solving
Regression problems, whereas Logistic regression is used for
solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an

"S" shaped logistic function, which predicts two maximum values (0
or 1).
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map
the predicted values to probabilities.
o It maps any real value into another value within a range of 0
and 1.
Z=mx+c
o The value of the logistic regression must be between 0 and 1,

which cannot go beyond this limit, so it forms a curve like the
"S" form. The S-form curve is called the Sigmoid function or the
logistic function.
o In logistic regression, we use the concept of the threshold
value, which defines the probability of either 0 or 1. Such as
values above the threshold value tends to 1, and a value below
the threshold values tends to 0.
G(x)=1/1+e-x
Y^ =g(x)
Y^ =1/1+e-(mx+c)
If z is very large positive value

e-(mx+c) =0 y^=1
If z is very large negative value

e-(mx+c) = large positive y^ =0
Confusion Matrix in Machine Learning:
The confusion matrix is a matrix used to determine the

performance of the classification models for a given set of test
data. It can only be determined if the true values for test data are
known. The matrix itself can be easily understood, but the related
terminologies may be confusing. Since it shows the errors in the
model performance in the form of a matrix, hence also known as
an error matrix. Some features of Confusion matrix are given
below.
Predict outcome
Positive Negative
Actual value Positive

TP FN
Negative
FP TN
The above table has the following cases:
o True Negative: Model has given prediction No, and the real or
actual value was also No.
o True Positive: The model has predicted yes, and the actual
value was also true.
o False Negative: The model has predicted no, but the actual
value was Yes, it is also called as Type-II error.
o False Positive: The model has predicted Yes, but the actual
value was No. It is also called a Type-I error.
Accuracy: It is one of the important parameters to determine the

accuracy of the classification problems. It defines how often the
model predicts the correct output. It can be calculated as the ratio of
the number of correct predictions made by the classifier to all
number of predictions made by the classifiers. The formula is given
below:
Accuracy=correct prediction/total prediction
Accuracy= TP+TN / TP+TN+FP+FN
Precision: It can be defined as the number of correct outputs
provided by the model or out of all positive classes that have
predicted correctly by the model, how many of them were actually
true. It can be calculated using the below formula
Precision= TP/TP+FP
Recall: It is defined as the out of total positive classes,

how our model predicted correctly. The recall must be as
high as possible.
Recall = TP / TP+FN
F-measure: If two models have low precision and high recall or vice
versa, it is difficult to compare these models. So, for this purpose, we
can use F-score. This score helps us to evaluate the recall and
precision at the same time. The F-score is maximum if the recall is
equal to the precision. It can be calculated using the below formula:
F-measure = 2 * Recall * Precision/ Recall + Precision
Misclassification rate: It is also termed as Error rate, and it defines
how often the model gives the wrong predictions. The value of error
rate can be calculated as the number of incorrect predictions to all
number of the predictions made by the classifier. The formula is
given below:
Misclassification rate = FP+FN/TP+TN+FP+FN

Decision tree
Decision Tree is a Supervised learning technique that can be used
for both classification and Regression problems, but mostly it is
preferred for solving Classification problems. It is a tree-structured
classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents
the outcome.
In a Decision tree, there are two nodes, which are the Decision
Node and Leaf Node. Decision nodes are used to make any decision
and have multiple branches, whereas Leaf nodes are the output of
those decisions and do not contain any further branches.
The decisions or the test are performed on the basis of features of

the given dataset.
It is a graphical representation for getting all the possible solutions to

a problem/decision based on given conditions.
A decision can contain categorical data (Yes/No) as well as

numerical data.
Root Node: Root node is from where the decision tree starts. It
represents the entire dataset, which further gets divided into two or
more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot
be segregated further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root
node into sub-nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted
branches from the tree.
Parent/Child node: The root node of the tree is called the parent
node, and other nodes are called the child nodes.
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how
to select the best attribute for the root node and for sub-nodes. So,
to solve such problems there is a technique which is called
as Attribute selection measure or ASM. By this measurement, we
can easily select the best attribute for the nodes of the tree. There
are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy

after the segmentation of a dataset based on an attribute.
o It calculates how much information a feature provides us about
a class.
o According to the value of information gain, we split the node
and build the decision tree.
o A decision tree algorithm always tries to maximize the value of
information gain, and a node/attribute having the highest
information gain is split first. It can be calculated using the
below formula:
Information Gain= Entropy(S)-[(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given

attribute. It specifies randomness in data. Entropy can be calculated
as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where, S= Total number of samples P(yes)= probability of yes
P(no)= probability of no
Gini Index:
o Gini index is a measure of impurity or purity used while

creating a decision tree in the CART(Classification and
Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as
compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the
Gini index to create binary splits.
o Gini index can be calculated using the below formula:
Gini = 1- Gini
Gini Index= 1- ∑ jPj2
Ensemble method
Ensemble methods is a machine learning technique that combines
several base models in order to produce one optimal predictive
model. To better understand this definition lets take a step back into
ultimate goal of machine learning and model building. This is going
to make more sense as I dive into specific examples and why
Ensemble methods are used.
Types of Ensemble Methods
1. BAGGING, or Bootstrap aggregating. Bagging gets its name

because it combines Bootstrapping and Aggregation to form one
ensemble model. Given a sample of data, multiple bootstrapped
subsamples are pulled. A Decision Tree is formed on each of the
bootstrapped subsamples. After each subsample Decision Tree
has been formed, an algorithm is used to aggregate over the
Decision Trees to form the most efficient predictor. The image
below will help explain:
Random Forest
Random Forest is a popular machine learning algorithm that

belongs to the supervised learning technique. It can be used for both
Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the
performance of the model.
As the name suggests, "Random Forest is a classifier that contains a

number of decision trees on various subsets of the given dataset and
takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest
takes the prediction from each tree and based on the majority votes
of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy

and prevents the problem of over fitting.
The Working process can be explained in the below steps and
diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data
points (Subsets).
Step-3: Choose the number N for decision trees that you want to
build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision
tree, and assign the new data points to the category that wins the
majority votes.
Clustering
Clustering or cluster analysis is a machine learning technique, which
groups the unlabeled dataset. It can be defined as "A way of
grouping the data points into different clusters, consisting of
similar data points. The objects with the possible similarities
remain in a group that has less or no similarities with another
group."
K-Mean clustering alga
K-Means Clustering is an unsupervised learning algorithm that is
used to solve the clustering problems in machine learning or data
science. In this topic, we will learn what is K-means clustering
algorithm, how the algorithm works, along with the Python
implementation of k-means clustering.
It is an iterative algorithm that divides the unlabeled dataset into k
different clusters in such a way that each dataset belongs only one
group that has similar properties.
The k-means clustering algorithm mainly performs two tasks:
o Determines the best value for K center points or centroids by

an iterative process.
o Assigns each data point to its closest k-center. Those data
points which are near to the particular k-center, create a
cluster.
The working of the K-Means algorithm is explained in the below
steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from

the input dataset).
Step-3: Assign each data point to their closest centroid, which will
form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each

cluster.
Step-5: Repeat the third steps, which means reassign each data
point to the new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to

FINISH.
Step-7: The model is ready.

Conclusion
The Industrial Training program should be taken seriously to ensure
that maximum benefit is obtained by the student in order to
increase their knowledge.
The Industrial Training component can add value to all degree
programs; specifically, it improves graduate’s work skills and
prepares them to face the challenges of the working world.
Apart from the learning from faculty, learning from the peers played
a major role during that period.
References
Analytics Vidhya - Learn Machine learning, artificial intelligence, business analytics, data
science, big data, data visualizations tools and techniques. | Analytics Vidhya.
Machine Learning Algorithms - Javatpoint
Machine Learning Training | Learn Machine Learning Online | Internshala Trainings

Industrialtrainingrepor12 210313013306

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Industrialtrainingrepor12 210313013306

Uploaded by

Copyright:

Available Formats

INDUSTRIAL TRAINING REPORT

In Partial Fulfillment of the Requirements

For the Degree of

Computer Science and Engineering

Roll No. 1700410019

Noida Gurgaon U.P

2.0 Company Background & Structure ….

4.0 Technical Contents ….

5.0 Learning outcome and work experience ….

I hereby declare that I have completed my six weeks summer

Name : Hrjeet Singh

Prof.(Dr.) Brajesh Kumar Singh

I acknowledgement the counsel and support of our training coordinator, Mr.

nternshala's mission is to equip every student with practical skills and

 How to determine and measure program complexity.

Week 1 Introduction to machine learning, Introduction to

Week 2 Introduction to python and Data Exploration and

Week 3 Linear Regression and Introduction to

Week 5 Ensemble model. Assingment 9.

Week 6 Clustering, project. Assingment 10.

 Industrial training at Internshala from 24th November 2020 to

Type of machine learning

Basic Libraries in Python

2. Normalizing Data -In this task, we rescale each observation to a length

4. Some labels can be words or numbers. Usually, training data is labelled

6. Standardizing Data -With standardizing, we can take attributes with a

It is the process of summarizing, visualizing and getting deeply

 Understand how the raw data was collected

 Get familiar with different characteristics of the data

 Extract insights that weren’t so evident to business stakeholders but can

 Discover hidden patterns in the data that allow for better

 Validate if the data has been generated in an expected manner

C= intercept of the line

m=Linear regression coefficient

o Gradient descent is used to minimize the MSE by calculating

o R-squared is a statistical method that determines the goodness

Linear relationship between the features and target: Linear regression

Small or no multi collinearity between the features: Multi collinearity

Homoscedasticity Assumption: Homoscedasticity is a situation when the

Normal distribution of error terms: Linear regression assumes that the

No autocorrelations: The linear regression model assumes no

The number of input features, variables, or columns present in a given dataset

A dataset contains a huge number of input features in various cases, which

Dimensionality reduction technique can be defined as, "It is a way of

In this technique, by selecting the optimum performance of the model and

Forward Feature Selection

Logistic regression is one of the most popular Machine Learning

Logistic regression predicts the output of a categorical dependent

Logistic Regression is much similar to the Linear Regression except

In Logistic regression, instead of fitting a regression line, we fit an

o The value of the logistic regression must be between 0 and 1,

If z is very large positive value

If z is very large negative value

Confusion Matrix in Machine Learning:

The confusion matrix is a matrix used to determine the

Actual value Positive

Accuracy: It is one of the important parameters to determine the

Recall: It is defined as the out of total positive classes,

Misclassification rate = FP+FN/TP+TN+FP+FN

The decisions or the test are performed on the basis of features of

It is a graphical representation for getting all the possible solutions to

A decision can contain categorical data (Yes/No) as well as