Professional Documents
Culture Documents
Industrialtrainingrepor12 210313013306
Industrialtrainingrepor12 210313013306
Machine Learning
Submitted
Bachelor of Technology
In
By
HRJEET SINGH
2017-2021
2021
Sponsored by
Internshala
Declaration i
Certificate ii
Acknowledgement iii
Abstract iv
1.0 Introduction ….
6.0 Conclusion
References
Declaration
Roll.No. 1700410019
Date:
CERTIFICATE
Hrjeet singh
1700410019
Abstract
Industrial training is an important phase of a student life. A well
planned, properly executed and evaluated industrial training helps a
lot in developing a professional attitude. It develop an awareness of
industrial approach to problem solving, based on a broad
understanding of process and mode of operation of organization.
The aim and motivation of this industrial training is to receive
discipline, skills, teamwork and technical knowledge through a
proper training environment, which will help me, as a student in the
field of Information Technology, to develop a responsiveness of the
self-disciplinary nature of problems in information and
communication technology
Company Background & structure
Company profile
Internshala was created with a mission to create skilled software engineers
for our country and the world. It aims to bridge the gap between the quality of
demanded by industry and the quality of skills imparted by conventional
institute. With assessments, learning paths and courses authored by industry
experts, Internshala help businesses and individual benchmark expertise
across roles, speed up release cycles and build reliable, secure products.
VISION
We are a technology company on a mission to equip students with relevant
skills & practical exposure through internships and online trainings. Imagine a
world full of freedom and possibilities. A world where you can discover your
passion and turn it into your career. A world where your practical skills
matter more than your university degree. A world where you do not have to
wait till 21 to taste your first work experience (and get a rude shock that it is
nothing like you had imagine it to be). A world where you graduate fully
assured, fully confident, and fully prepared to stake claim on your place in the
world.
History
The platform, which was founded in 2010, started out as a WordPress blog
that aggregated internships across India and articles on education, technology
and skill gap. Internshala launched its online trainings in 2014. As of 2018,
the platform had 3.5 million students and 80,000 companies.
Mission
Weekly Summery
Supervised Learning
Supervised Learning is a type of learning in witch we are given a
data set and we already know what are correct output should look
like, having the idea that there is a relationship between the input
and output. Basically, it is learning task of learning a function that
maps an input to an output based on example input-output pair.
Unsupervised Learning
Unsupervised learning is a type of learning that allow us to
approach problems with little or no idea our problem should look
like. We can derive the structure by clustering the data based on
relationship among the variables in data. With unsupervised
learning there is no feedback based on prediction result. Basically , it
is a type of self-organized learning that help in finding previously
unknown patterns in the data set without pre-existing label.
Data
Data is collection of information about any things.
Ex. Notification, Activity of time, Clock alarm etc.
Two type of data use in machine learning models,
1 Labeled data
2 Unlabeled data
Labeled data
The data which contain a target variable or an output variable that
answer a question of interest is called labeled data.
Unlabeled data
Unlabeled data is a designation for pieces of data that have not been
tagged with labels identifying characteristics, properties or
classifications.
Introduction to Python
Python is a widely used general-purpose, high level programming
language. It was initially designed by Guido van Rossum in 1991 and
developed by Python Software Foundation. It was mainly developed
for an emphasis on code readability, and its syntax allows
programmers to express concepts in fewer lines of code. Python is
dynamically typed and garbage-collected. It supports
multiple programming paradigms, including procedural, object-
oriented, and functional programming. Python is often described as
a "batteries included" language due to its comprehensive standard
library.
3. Mean Removal-We can remove the mean from each feature to center it
on zero.
5. One Hot Encoding -When dealing with few and scattered numerical
values, we may not need to store these. Then, we
can perform One Hot Encoding. For k distinct values, we can transform t
he feature into a k-dimensionalvector with one value of 1 and 0 as the
rest values.
To achieve this level of certainty, here’s what you can do with EDA:
Learn about the individual features and their mutual relationships (or
lack of)
Check and validate the data for anomalies, outliers, missing values,
human errors, etc.
Here:
Y=Dependent Variable (Target Variable)
X=Independent Variable (predictor Variable)
o The different values for weights or coefficient of lines (a 0, a1) gives the
different line of regression, and the cost function is used to estimate the
values of the coefficient for the best fit line.
o Cost function optimizes the regression coefficients or weights. It
measures how a linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping
function, which maps the input variable to the output variable. This
mapping function is also known as Hypothesis function.
MAE (Mean absolute error) represents the difference between the original
and predicted values extracted by averaged the absolute difference over the
data set.
MSE (Mean Squared Error) represents the difference between the original and
predicted values extracted by squared the average difference over the data
set.
RMSE (Root Mean Squared Error) is the error rate by the square root of MSE.
R-squared (Coefficient of determination) represents the coefficient of how
well the values fit compared to the original values. The value from 0 to 1
interpreted as percentages. The higher the value is, the better the model is.
The above metrics can be expressed,
,
Gradient Descent:
Model Performance:
The Goodness of fit determines how the line of regression fits the set
of observations. The process of finding the best model out of various
models is called optimization. It can be achieved by below method :
1. R-squared method:
Below are some important assumptions of Linear Regression. These are some
formal checks while building a Linear Regression model, which ensures to get
the best possible result from the given dataset.
Missing Value Ratio : If a dataset has too many missing values, then we drop
those variables as they do not carry much useful information. To perform this,
we can set a threshold level, and if a variable has missing values more than
that threshold, we will drop that variable. The higher the threshold value, the
more efficient the reduction.
Low Variance Filter : As same as missing value ratio technique, data columns
with some changes in the data have less information. Therefore, we need to
calculate the variance of each variable, and all data columns with variance
lower than a given threshold are dropped because low variance features will
not affect the target variable.
High Correlation Filter: High Correlation refers to the case when two
variables carry approximately similar information. Due to this factor, the
performance of the model can be degraded. This correlation between the
independent numerical variable gives the calculated value of the correlation
coefficient. If this value is higher than the threshold value, we can remove one
of the variables from the dataset. We can consider those variables or features
that show a high correlation with the target variable.
Backward Feature Elimination
The backward feature elimination technique is mainly used while developing
Linear Regression or Logistic Regression model. Below steps are performed in
this technique to reduce the dimensionality or in feature selection:
o In this technique, firstly, all the n variables of the given dataset are taken
to train the model.
o The performance of the model is checked.
o Now we will remove one feature each time and train the model on n-1
features for n times, and will compute the performance of the model.
o We will check the variable that has made the smallest or no change in
the performance of the model, and then we will drop that variable or
features; after that, we will be left with n-1 features.
o Repeat the complete process until no feature can be dropped.
o We start with a single feature only, and progressively we will add each
feature at a time.
o Here we will train the model on each feature separately.
o The feature with the best performance is selected.
o The process will be repeated until we get a significant increase in the
performance of the model.
Logistic regression
Z=mx+c
G(x)=1/1+e-x
Y^ =g(x)
Y^ =1/1+e-(mx+c)
Predict outcome
Positive Negative
o True Negative: Model has given prediction No, and the real or
actual value was also No.
o True Positive: The model has predicted yes, and the actual
value was also true.
o False Negative: The model has predicted no, but the actual
value was Yes, it is also called as Type-II error.
o False Positive: The model has predicted Yes, but the actual
value was No. It is also called a Type-I error.
Precision= TP/TP+FP
In a Decision tree, there are two nodes, which are the Decision
Node and Leaf Node. Decision nodes are used to make any decision
and have multiple branches, whereas Leaf nodes are the output of
those decisions and do not contain any further branches.
While implementing a Decision tree, the main issue arises that how
to select the best attribute for the root node and for sub-nodes. So,
to solve such problems there is a technique which is called
as Attribute selection measure or ASM. By this measurement, we
can easily select the best attribute for the nodes of the tree. There
are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
P(no)= probability of no
Gini Index:
Ensemble method
Ensemble methods is a machine learning technique that combines
several base models in order to produce one optimal predictive
model. To better understand this definition lets take a step back into
ultimate goal of machine learning and model building. This is going
to make more sense as I dive into specific examples and why
Ensemble methods are used.
Random Forest
Step-2: Build the decision trees associated with the selected data
points (Subsets).
Step-3: Choose the number N for decision trees that you want to
build.
Step-5: For new data points, find the predictions of each decision
tree, and assign the new data points to the category that wins the
majority votes.
Clustering
Clustering or cluster analysis is a machine learning technique, which
groups the unlabeled dataset. It can be defined as "A way of
grouping the data points into different clusters, consisting of
similar data points. The objects with the possible similarities
remain in a group that has less or no similarities with another
group."
K-Mean clustering alga
K-Means Clustering is an unsupervised learning algorithm that is
used to solve the clustering problems in machine learning or data
science. In this topic, we will learn what is K-means clustering
algorithm, how the algorithm works, along with the Python
implementation of k-means clustering.
It is an iterative algorithm that divides the unlabeled dataset into k
different clusters in such a way that each dataset belongs only one
group that has similar properties.
Step-3: Assign each data point to their closest centroid, which will
form the predefined K clusters.
Step-5: Repeat the third steps, which means reassign each data
point to the new closest centroid of each cluster.
References
Analytics Vidhya - Learn Machine learning, artificial intelligence, business analytics, data
science, big data, data visualizations tools and techniques. | Analytics Vidhya.