Professional Documents
Culture Documents
Nitin Jha (05114802819)
Nitin Jha (05114802819)
ON
“MACHINE LEARNING”
Submitted in partial fulfillment of the requirements for
the award of the degree of
BACHELOR OF TECHNOLOGY
IN
ELECTRONICS AND COMMUNICATION
ENGINEERING
Submitted By
NITIN JHA
ENROLLMENT NO - (05114802819)
• Spam Detector: Our mail agent like Gmail or Hotmail does a lot of hard work for us in classifying
the mails and moving the spam mails to spam folder. This is again achieved by a spam classifier
running in the back end of mail application.
• Database Mining for growth of automation: Typical applications include Web-click data for better
UX, Medical records for better automation in healthcare, biological data and many more.
• Applications that cannot be programmed: There are some tasks that cannot be programmed as the
computers we use are not modelled that way. Examples include Autonomous Driving, Recognition
tasks from unordered data (Face Recognition/ Handwriting Recognition), Natural language
Processing, computer Vision etc.
• Understanding Human Learning: This is the closest we have understood and mimicked the human
brain. It is the start of a new revolution, The real AI. Now, after a brief insight lets come to a more
formal definition of Machine Learning
REFERENCES
• https://expertsystem.com/
• https://www.geeksforgeeks.org/
• https://www.wikipedia.org/
• https://machinelearningmastery.com/
• https://towardsdatascience.com/machine-learning/home
INTRODUCTION
Machine Learning is the science of getting computers to learn without being explicitly programmed. It is closely
related to computational statistics, which focuses on making prediction using computer. In its application across
business problems, machine learning is also referred as predictive analysis. Machine Learning is closely related
to computational statistics. Machine Learning focuses on the development of computer programs that can access
data and use it to learn themselves. The process of learning begins with observations or data, such as examples,
direct experience, or instruction, in order to look for patterns in data and make better decisions in the future
based on the examples that we provide. The primary aim is to allow the computers learn automatically without
human intervention or assistance and adjust actions accordingly.
SIPERVISED LEARNING
Supervised Learning is a type of learning in which we are given a data set and we already know what are correct
output should look like, having the idea that there is a relationship between the input and output. Basically, it is
learning task of learning a function that maps an input to an output based on example inputoutputpairs. It infers a
function from labeled training data consisting of a set of training examples. Supervised learning problems are
categorized
UNSUPERVISED LEARING
Unsupervised Learning is a type of learning that allows us to approach problems with little or no idea what our
problem should look like. We can derive the structure by clustering the data based on a relationship amongthe
variables in data. With unsupervised learning there is no feedback based on prediction result. Basically, it is a
type of self-organized learning that helps in finding previously unknown patterns in data set without pre-
existing label.
REINFORCEMENT LEARNING
Reinforcement learning is a learning method that interacts with its environment by producing actions and
discovers errors or rewards. Trial and error search and delayed reward are the most relevant characteristics of
reinforcement learning. This method allows machines and software agents to automatically determine the ideal
behavior within a specific context in order to maximize its performance. Simple reward feedback is required
for the agent to learn which action is best.
SEMI-SUPERVISED LEARNING
Semi-supervised learning fall somewhere in between supervised and unsupervised learning, since they use both
labeled and unlabeled data for training – typically a small amount of labeled data and a large amount of
unlabeled data. The systems that use this method are able to considerably improve learning accuracy. Usually,
semi-supervised learning is chosen when the acquired labeled data requires skilled and relevant resources in
order to train it / learn from it. Otherwise, acquiring unlabeled data generally doesn’t require additional
resources.
Python Introduction
Python is an interpreted, high-level, general-purpose programming language. It has efficient high-level data
structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and
dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid
application development in many areas on most platforms.
Why Python???
LISTS: A list is an ordered data structure with elements separated by comma and enclosed within
square brackets.
DICTIONARY: A dictionary is an unordered data structure with elements separated by comma and
stored as key: value pair, enclosed with curly braces {}.
Data Preprocessing, Analysis & Visualization
Machine Learning algorithms don’t work so well with processing raw data. Before we can feed such data
to an ML algorithm, we must preprocess it. We must apply some transformations on it. With data
preprocessing, we convert raw data into a clean data set. To perform data this, there are 7 techniques -
1. Rescaling Data -
For data with attributes of varying scales, we can rescale attributes to possess the same scale. We rescale
attributes into the range 0 to 1 and call it normalization. We use the Min Max Scaler class from scikit learn.
This gives us values between 0 and 1.
2. Standardizing Data -
With standardizing, we can take attributes with a Gaussian distribution and different means and standard
deviations and transform them into a standard Gaussian distribution with a mean of 0 and a standard
deviation of 1.
3. Normalizing Data -
In this task, we rescale each observation to a length of 1 (a unit norm). For this, we use the Normalizer
class.
4. Binarizing Data -
Using a binary threshold, it is possible to transform our data by marking the values above it 1 and those
equal to or below it, 0. For this purpose, we use the Binarize class.
5. Mean Removal-
We can remove the mean from each feature to center it on zero.
7. Label Encoding -
Some labels can be words or numbers. Usually, training data is labelled with words to make it readable.Label encoding
converts word labels into numbers to let algorithms work on them
Variable Treatment
It is the process of identifying whether variable is
1. Independent or dependent variable
2. Continuous or categorical variable
Why do we perform variable identification?
1. Techniques like supervised learning require identification of dependent variable.
2. Different data processing techniques for categorical and continuous data.
Categorical variable- Stored as object.
Continuous variable-Stored as int or float.
Univariate Analysis
1. Explore one variable at a time.
2. Summarize the variable.
3. Make sense out of that summary to discover insights, anomalies, etc.
Bivariate Analysis
• When two variables are studied together for their empirical relationship.
• When you want to see whether the two variables are associated with each other.
• It helps in prediction and detecting anomalies.
Missing Value Treatment
Reasons of missing value
1. Non-response – E.g.-when you collect data on people’s income and many choose not to answer.
2. Error in data collection. e.g.- Faculty data
3. Error in data reading.
Types
1. MCAR (Missing completely at random): Missing values have no relation to the variable in which
missing value exist and other variables in dataset.
2. MAR (Missing at random): Missing values have no relation to the in which missing value exist and
the variables other than the variables in which missing values exist.
3. MNAR (Missing not at random): Missing values have relation to the variable in which missing value
exists
Identifying
Syntax: -
1. describe()
2. Is null ()
Output will we in True or False
Different methods to deal with missing values
1. Imputation
Continuous-Impute with help of mean, median or regression mode.
Categorical-With mode, classification model.
2. Deletion
Row wise or column wise deletion. But it leads to loss of data.
Outlier Treatment
Reasons of Outliers
1. Data entry Errors
2. Measurement Errors
3. Processing Errors
4. Change in underlying population
Types of Outlier
Univariate
Analyzing only one variable for outlier.
e.g. – In box plot of height and weight.
Weight will we analyzed for outlier
Bivariate
Analyzing both variables for outlier.
e.g.- In scatter plot graph of height and weight. Both will we analyzed.
Identifying Outlier
Graphical Method
• Box Plot
• Scatter Plot
Formula Method
Using Box Plot
< Q1 - 1.5 * IQR or > Q3+1.5 * IQR
Where IQR= Q3 – Q1
Q3=Value of 3rd quartile
Q1=Value of 1st quartile
Treating Outlier
1. Deleting observations
2. Transforming and binning values
3. Imputing outliers like missing values
4. Treat them as separate
Variable Transformation
Is the process by which-
1. We replace a variable with some function of that variable. e.g. – Replacing a variable x with its log.
2. We change the distribution or relationship of a variable with others.
Used to –
1. Change the scale of a variable
2. Transforming non-linear relationships into linear relationship
3. Creating symmetric distribution from skewed distribution.
Common methods of Variable Transformation – Logarithm, Square root, Cube root, Binning, etc.
Model Building
It is a process to create a mathematical model for estimating / predicting the future based on past data.
e.g.-
A retail wants to know the default behavior of its credit card customers. They want to predict the
probability of default for each customer in next three months.
• Probability of default would lie between 0 and 1.
• Assume every customer has a 10% default rate.
Probability of default for each customer in next 3 months=0.1
It moves the probability towards one of the extremes based on attributes of past information.
A customer with volatile income is more likely (closer to) to default.
A customer with healthy credit history for last years has low chances of default (closer to 0).
Algorithm Selection
Example-
Yes No
Supervised Unsupervised
Learning Learning
Is dependent
variable continuous?
Yes No
Regression Classification
e.g.- Predict the customer will buy product or not.
Algorithms
• Logistic Regression
• Decision Tree
• Random Forest
Training Model
It is a process to learn relationship / correlation between independent and dependent variables.
We use dependent variable of train data set to predict/estimate.
Dataset
• Train
Past data (known dependent variable).
Used to train model.
• Test
Future data (unknown dependent variable)
Used to score.
Prediction / Scoring
It is the process to estimate/predict dependent variable of train data set by applying model rules.
We apply training learning to test data set for prediction/estimation.
10
0
0 1 2 3 4 5 6 7 8 9
Logistic Regression
Logistic regression is a statistical model that in its basic form uses a logistic function to model a
binarydependent variable, although many more complex extensions exist.
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data
(i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data,
with the number of groups represented by the variable K. The algorithm works iteratively to assign
eachdata point to one of K groups based on the features that are provided. Data points are clustered
based on feature similarity.
DECISION TREES
A decision tree falls under supervised Machine Learning Algorithms in Python and comes of use for both
classification and regression- although mostly for classification. This model takes an instance, traverses the
tree, and compares important features with a determined conditional statement. Whether it descends to the left
child branch or the right depends on the result. Usually, more important features are closer to the root.
Decision Tree, a Machine Learning algorithm in Python can work on both categorical and continuous
dependent variables. Here, we split a population into two or more homogeneous sets. Tree models where the
target variable can take a discrete set of values are called classification trees; in these tree structures, leaves
represent class labels and branches represent conjunctions of features that lead to those class labels. Decision
trees where the target variable can take continuous values (typically real numbers) are called regression trees.
KNN ALGORITHM
This is a Python Machine Learning algorithm for classification and regression- mostly for classification. This
is a supervised learning algorithm that considers different centroids and uses a usually Euclidean function to
compare distance. Then, it analyzes the results and classifies each point to the group to optimize it to place
with all closest points to it. It classifies new cases using a majority vote of k of its neighbors. The case it assigns
to a class is the one most common among its K nearest neighbors. For this, it uses a distance function. k-NN
is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all
computation is deferred until classification. k-NN is a special case of a variable band width, kernel density
"balloon" estimator with a uniform kernel.
RANDOM FOREST
A random forest is an ensemble of decision trees. In order to classify every new object based on its attributes,
trees vote for class- each tree provides a classification. The classification with the most votes wins in the forest.
Random forests or random decision forests are an ensemble learning method for classification, regression and
other tasks that operates by constructing a multitude of decision trees at training time and outputting the class
that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Problem Description