Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

INDUSTRIAL TRAINING REPORT

ON
“MACHINE LEARNING”
Submitted in partial fulfillment of the requirements for
the award of the degree of

BACHELOR OF TECHNOLOGY
IN
ELECTRONICS AND COMMUNICATION
ENGINEERING

Submitted By
NITIN JHA
ENROLLMENT NO - (05114802819)

Department of Electronics and Communication


Engineering

Maharaja Agrasen Institute of Technology


&
Management

Guru Gobind Singh Indraprastha University


Dwarka, New Delhi-110078
ACKNOWLEDGEMENT
I would like to acknowledge the contributions of the following people without whose help and guidance this
report would not have been completed.
I acknowledge the counsel and support of our training coordinator, Mr. PRAVEEN KUMAR , Assistant
Professor, ECE Department, with respect and gratitude, whose expertise, guidance, support, encouragement,
and enthusiasm has made this report possible. Their feedback vastly improved the quality of this report and
provided an enthralling experience. I am indeed proud and fortunate to be supported by him/her.
I am also thankful to Prof. (Dr.) Sunil Mathur, H.O.D of Electronics and Communication Engineering
Department, Maharaja Agrasen Institute of Technology for his constant encouragement, valuable
suggestions and moral support and blessings.
Although it is not possible to name individually, I shall ever remain indebted to the faculty members of
Maharaja Agrasen Institute of Technology, Rohini sector-22, Delhi their persistent support and
cooperation extended during this work.
This acknowledgement will remain incomplete if I fail to express our deep sense of obligation to my parents
and God for their consistent blessings and encouragement.
OBJECTIVES
1.) Introduction to machine learning
2.) Data
3.) Introduction to Python
4.) Data exploration and Preprocessing
5.) Linear Regression
6.) Logistic Regression
7.) Decision Trees
8.) Ensemble Models
9.) Clustering

APPLICATION OF MACHINE LEARNING


• Web Search Engine: One of the reasons why search engines like google, binge etc. work so well
isbecause the system has learnt how to rank pages through a complex learning algorithm.
• Photo tagging Applications: Be it Facebook or any other photo tagging application, the ability to
tagfriends makes it even more happening. It is all possible because of a face recognition algorithm
that runs behind the application.

• Spam Detector: Our mail agent like Gmail or Hotmail does a lot of hard work for us in classifying
the mails and moving the spam mails to spam folder. This is again achieved by a spam classifier
running in the back end of mail application.

• Database Mining for growth of automation: Typical applications include Web-click data for better
UX, Medical records for better automation in healthcare, biological data and many more.

• Applications that cannot be programmed: There are some tasks that cannot be programmed as the
computers we use are not modelled that way. Examples include Autonomous Driving, Recognition
tasks from unordered data (Face Recognition/ Handwriting Recognition), Natural language
Processing, computer Vision etc.

• Understanding Human Learning: This is the closest we have understood and mimicked the human
brain. It is the start of a new revolution, The real AI. Now, after a brief insight lets come to a more
formal definition of Machine Learning

REFERENCES

• https://expertsystem.com/
• https://www.geeksforgeeks.org/
• https://www.wikipedia.org/
• https://machinelearningmastery.com/
• https://towardsdatascience.com/machine-learning/home
INTRODUCTION
Machine Learning is the science of getting computers to learn without being explicitly programmed. It is closely
related to computational statistics, which focuses on making prediction using computer. In its application across
business problems, machine learning is also referred as predictive analysis. Machine Learning is closely related
to computational statistics. Machine Learning focuses on the development of computer programs that can access
data and use it to learn themselves. The process of learning begins with observations or data, such as examples,
direct experience, or instruction, in order to look for patterns in data and make better decisions in the future
based on the examples that we provide. The primary aim is to allow the computers learn automatically without
human intervention or assistance and adjust actions accordingly.

TYPES OF MACHINE LEARNING


The types of machine learning algorithms differ in their approach, the type of data they input and output, and
the type of task or problem that they are intended to solve. Broadly Machine Learning can be categorized into
four categories.
I. Supervised Learning
II. Unsupervised Learning
III. Reinforcement Learning
IV. Semi-supervised Learning

SIPERVISED LEARNING
Supervised Learning is a type of learning in which we are given a data set and we already know what are correct
output should look like, having the idea that there is a relationship between the input and output. Basically, it is
learning task of learning a function that maps an input to an output based on example inputoutputpairs. It infers a
function from labeled training data consisting of a set of training examples. Supervised learning problems are
categorized

UNSUPERVISED LEARING
Unsupervised Learning is a type of learning that allows us to approach problems with little or no idea what our
problem should look like. We can derive the structure by clustering the data based on a relationship amongthe
variables in data. With unsupervised learning there is no feedback based on prediction result. Basically, it is a
type of self-organized learning that helps in finding previously unknown patterns in data set without pre-
existing label.
REINFORCEMENT LEARNING

Reinforcement learning is a learning method that interacts with its environment by producing actions and
discovers errors or rewards. Trial and error search and delayed reward are the most relevant characteristics of
reinforcement learning. This method allows machines and software agents to automatically determine the ideal
behavior within a specific context in order to maximize its performance. Simple reward feedback is required
for the agent to learn which action is best.

SEMI-SUPERVISED LEARNING

Semi-supervised learning fall somewhere in between supervised and unsupervised learning, since they use both
labeled and unlabeled data for training – typically a small amount of labeled data and a large amount of
unlabeled data. The systems that use this method are able to considerably improve learning accuracy. Usually,
semi-supervised learning is chosen when the acquired labeled data requires skilled and relevant resources in
order to train it / learn from it. Otherwise, acquiring unlabeled data generally doesn’t require additional
resources.

Applications of Machine Learning


Machine learning is one of the most exciting technologies that one would have ever come across. As it is evident
from the name, it gives the computer that which makes it more similar to humans: The ability to learn.Machine
learning is actively being used today, perhaps in many more places than one would expect. We probably use a
learning algorithm dozen of time without even knowing it. Applications of Machine Learninginclude:
• Web Search Engine: One of the reasons why search engines like google, binge etc. work so well is
because the system has learnt how to rank pages through a complex learning algorithm.
• Photo tagging Applications: Be it Facebook or any other photo tagging application, the ability to tag
friends makes it even more happening. It is all possible because of a face recognition algorithm that runs
behind the application.
• Spam Detector: Our mail agent like Gmail or Hotmail does a lot of hard work for us in classifying the
mails and moving the spam mails to spam folder. This is again achieved by a spam classifier running in
the back end of mail application.

Python Introduction
Python is an interpreted, high-level, general-purpose programming language. It has efficient high-level data
structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and
dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid
application development in many areas on most platforms.

Python for Data science:

Why Python???

1. Python is an open source language.


2. Syntax as simple as English.
3. Very large and Collaborative developer community.
4. Extensive Packages.
• UNDERSTANDING OPERATORS:
Theory of operators: - Operators are symbolic representation of Mathematical tasks.
• VARIABLES AND DATATYPES:
Variables are named bounded to objects. Data types in python are int (Integer), Float, Boolean and
strings.
• CONDITIONAL STATEMENTS:
If-else statements (Single condition)
If- else statements (Multiple Condition)
• LOOPING CONSTRUCTS:
For loop
• FUNCTIONS:
Functions are re-usable piece of code. Created for solving specific problem.
Two types: Built-in functions and User- defined functions.
Functions cannot be reused in python.
• DATA STRUCTURES:

Two types of Data structures:

LISTS: A list is an ordered data structure with elements separated by comma and enclosed within
square brackets.

DICTIONARY: A dictionary is an unordered data structure with elements separated by comma and
stored as key: value pair, enclosed with curly braces {}.
Data Preprocessing, Analysis & Visualization
Machine Learning algorithms don’t work so well with processing raw data. Before we can feed such data
to an ML algorithm, we must preprocess it. We must apply some transformations on it. With data
preprocessing, we convert raw data into a clean data set. To perform data this, there are 7 techniques -

1. Rescaling Data -
For data with attributes of varying scales, we can rescale attributes to possess the same scale. We rescale
attributes into the range 0 to 1 and call it normalization. We use the Min Max Scaler class from scikit learn.
This gives us values between 0 and 1.

2. Standardizing Data -
With standardizing, we can take attributes with a Gaussian distribution and different means and standard
deviations and transform them into a standard Gaussian distribution with a mean of 0 and a standard
deviation of 1.

3. Normalizing Data -
In this task, we rescale each observation to a length of 1 (a unit norm). For this, we use the Normalizer
class.

4. Binarizing Data -
Using a binary threshold, it is possible to transform our data by marking the values above it 1 and those
equal to or below it, 0. For this purpose, we use the Binarize class.

5. Mean Removal-
We can remove the mean from each feature to center it on zero.

6. One Hot Encoding -


When dealing with few and scattered numerical values, we may not need to store these. Then, we can
perform One Hot Encoding. For k distinct values, we can transform the feature into a k-dimensional vector
with one value of 1 and 0 as the rest values.

7. Label Encoding -
Some labels can be words or numbers. Usually, training data is labelled with words to make it readable.Label encoding
converts word labels into numbers to let algorithms work on them
Variable Treatment
It is the process of identifying whether variable is
1. Independent or dependent variable
2. Continuous or categorical variable
Why do we perform variable identification?
1. Techniques like supervised learning require identification of dependent variable.
2. Different data processing techniques for categorical and continuous data.
Categorical variable- Stored as object.
Continuous variable-Stored as int or float.
Univariate Analysis
1. Explore one variable at a time.
2. Summarize the variable.
3. Make sense out of that summary to discover insights, anomalies, etc.
Bivariate Analysis
• When two variables are studied together for their empirical relationship.
• When you want to see whether the two variables are associated with each other.
• It helps in prediction and detecting anomalies.
Missing Value Treatment
Reasons of missing value
1. Non-response – E.g.-when you collect data on people’s income and many choose not to answer.
2. Error in data collection. e.g.- Faculty data
3. Error in data reading.
Types
1. MCAR (Missing completely at random): Missing values have no relation to the variable in which
missing value exist and other variables in dataset.
2. MAR (Missing at random): Missing values have no relation to the in which missing value exist and
the variables other than the variables in which missing values exist.
3. MNAR (Missing not at random): Missing values have relation to the variable in which missing value
exists
Identifying
Syntax: -
1. describe()
2. Is null ()
Output will we in True or False
Different methods to deal with missing values
1. Imputation
Continuous-Impute with help of mean, median or regression mode.
Categorical-With mode, classification model.
2. Deletion
Row wise or column wise deletion. But it leads to loss of data.
Outlier Treatment
Reasons of Outliers
1. Data entry Errors
2. Measurement Errors
3. Processing Errors
4. Change in underlying population
Types of Outlier

Univariate
Analyzing only one variable for outlier.
e.g. – In box plot of height and weight.
Weight will we analyzed for outlier

Bivariate
Analyzing both variables for outlier.
e.g.- In scatter plot graph of height and weight. Both will we analyzed.
Identifying Outlier

Graphical Method
• Box Plot

• Scatter Plot

Formula Method
Using Box Plot
< Q1 - 1.5 * IQR or > Q3+1.5 * IQR
Where IQR= Q3 – Q1
Q3=Value of 3rd quartile
Q1=Value of 1st quartile
Treating Outlier
1. Deleting observations
2. Transforming and binning values
3. Imputing outliers like missing values
4. Treat them as separate
Variable Transformation
Is the process by which-
1. We replace a variable with some function of that variable. e.g. – Replacing a variable x with its log.
2. We change the distribution or relationship of a variable with others.
Used to –
1. Change the scale of a variable
2. Transforming non-linear relationships into linear relationship
3. Creating symmetric distribution from skewed distribution.
Common methods of Variable Transformation – Logarithm, Square root, Cube root, Binning, etc.
Model Building

It is a process to create a mathematical model for estimating / predicting the future based on past data.
e.g.-
A retail wants to know the default behavior of its credit card customers. They want to predict the
probability of default for each customer in next three months.
• Probability of default would lie between 0 and 1.
• Assume every customer has a 10% default rate.
Probability of default for each customer in next 3 months=0.1
It moves the probability towards one of the extremes based on attributes of past information.
A customer with volatile income is more likely (closer to) to default.
A customer with healthy credit history for last years has low chances of default (closer to 0).

Steps in Model Building


1. Algorithm Selection
2. Training Model
3. Prediction / Scoring

Algorithm Selection
Example-

Have dependent variable?

Yes No

Supervised Unsupervised
Learning Learning

Is dependent
variable continuous?

Yes No

Regression Classification
e.g.- Predict the customer will buy product or not.
Algorithms
• Logistic Regression
• Decision Tree
• Random Forest

Training Model
It is a process to learn relationship / correlation between independent and dependent variables.
We use dependent variable of train data set to predict/estimate.

Dataset
• Train
Past data (known dependent variable).
Used to train model.
• Test
Future data (unknown dependent variable)
Used to score.
Prediction / Scoring
It is the process to estimate/predict dependent variable of train data set by applying model rules.
We apply training learning to test data set for prediction/estimation.

Algorithm of Machine Learning


Linear Regression
Linear regression is a statistical approach for modelling relationship between a dependent variable with a
given set of independent variables.
It is assumed that the wo variables are linearly related. Hence, we try to find a linear function. That predicts
the response value(y) as accurately as possible as a function of the feature or independent variable(x).

Y-Values The equation of regression line is


14 represented as:
12

10

6 The squared error or cost function, J as:


4

0
0 1 2 3 4 5 6 7 8 9
Logistic Regression

Logistic regression is a statistical model that in its basic form uses a logistic function to model a
binarydependent variable, although many more complex extensions exist.

C = -y (log(y) – (1-y) log(1-y))

K-Means Clustering (Unsupervised learning)

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data
(i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data,
with the number of groups represented by the variable K. The algorithm works iteratively to assign
eachdata point to one of K groups based on the features that are provided. Data points are clustered
based on feature similarity.

DECISION TREES
A decision tree falls under supervised Machine Learning Algorithms in Python and comes of use for both
classification and regression- although mostly for classification. This model takes an instance, traverses the
tree, and compares important features with a determined conditional statement. Whether it descends to the left
child branch or the right depends on the result. Usually, more important features are closer to the root.

Decision Tree, a Machine Learning algorithm in Python can work on both categorical and continuous
dependent variables. Here, we split a population into two or more homogeneous sets. Tree models where the
target variable can take a discrete set of values are called classification trees; in these tree structures, leaves
represent class labels and branches represent conjunctions of features that lead to those class labels. Decision
trees where the target variable can take continuous values (typically real numbers) are called regression trees.

KNN ALGORITHM

This is a Python Machine Learning algorithm for classification and regression- mostly for classification. This
is a supervised learning algorithm that considers different centroids and uses a usually Euclidean function to
compare distance. Then, it analyzes the results and classifies each point to the group to optimize it to place
with all closest points to it. It classifies new cases using a majority vote of k of its neighbors. The case it assigns
to a class is the one most common among its K nearest neighbors. For this, it uses a distance function. k-NN
is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all
computation is deferred until classification. k-NN is a special case of a variable band width, kernel density
"balloon" estimator with a uniform kernel.
RANDOM FOREST

A random forest is an ensemble of decision trees. In order to classify every new object based on its attributes,
trees vote for class- each tree provides a classification. The classification with the most votes wins in the forest.
Random forests or random decision forests are an ensemble learning method for classification, regression and
other tasks that operates by constructing a multitude of decision trees at training time and outputting the class
that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Problem Description

Provided with following files: train.csv and test.csv.


Use train.csv dataset to train the model. This file contains all the client and call details as well as the target
variable “subscribed”. Then use the trained model to predict whether a new set of clients will subscribe the
term deposit.

You might also like