Professional Documents
Culture Documents
Lecture 1. Introduction To Machine Learning
Lecture 1. Introduction To Machine Learning
Machine learning is a branch of artificial intelligence (AI) and computer science, which focuses on the use
of data and algorithms to imitate the way that humans learn, gradually improving its accuracy. Machine
learning (ML) is a type of artificial intelligence (AI) that allows software applications to become more
accurate at predicting outcomes without being explicitly programmed to do so. Machine
learning algorithms use historical data as input to predict new output values.
Over the last couple of decades, the technological advances in storage and processing power have enabled
some innovative products based on machine learning, such as Netflix’s recommendation engine and self-
driving cars.
Supervised learning, also known as supervised machine learning, is defined by its use of labeled datasets to
train algorithms to classify data or predict outcomes accurately. As input data is fed into the model, the
model adjusts its weights until it has been fitted appropriately. This occurs as part of the cross validation
process to ensure that the model avoids over fitting or under fitting. Supervised learning helps organizations
solve a variety of real-world problems at scale, such as classifying spam in a separate folder from your
inbox. Some methods used in supervised learning include neural networks, naïve bayes, linear regression,
logistic regression, random forest, and support vector machine (SVM).
Unsupervised learning, also known as unsupervised machine learning, uses machine-learning algorithms to
analyze and cluster unlabeled datasets. These algorithms discover hidden patterns or data groupings
without the need for human intervention. This method’s ability to discover similarities and differences in
information make it ideal for exploratory data analysis, cross-selling strategies, customer segmentation,
and image and pattern recognition. It is also used to reduce the number of features in a model through the
process of dimensionality reduction. Principal component analysis (PCA) and singular value decomposition
(SVD) are two common approaches for this. Other algorithms used in unsupervised learning include neural
networks, k-means clustering, and probabilistic clustering methods.
3. Semi-supervised learning
Semi-supervised learning offers a happy medium between supervised and unsupervised learning. During
training, it uses a smaller labeled data set to guide classification and feature extraction from a larger,
unlabeled data set. Semi-supervised learning can solve the problem of not having enough labeled data for a
supervised learning algorithm. It also helps if it is too costly to label enough data.
Although the terms Data Science vs. Machine Learning vs. Artificial Intelligence might be related
and interconnected, each is unique and is used for different purposes. Data Science is a broad term, and
Machine Learning falls within it. Here’s the critical difference between the terms.
Artificial Intelligence combines Machine Learning uses efficient Data Science works by
large amounts of data through programs that can use data sourcing, cleaning, and
iterative processing and intelligent without being explicitly told to do processing data to extract
algorithms to help computers learn so. meaning out of it for
automatically. analytical purposes.
Some of the popular tools that AI The popular tools that Machine Some of the popular tools
uses are- Learning makes use of are-1. used by Data Science are-1.
1. TensorFlow2. Scikit Learn Amazon Lex2. IBM Watson SAS2. Tableau3. Apache
3. Keras Studio3. Microsoft Azure ML Spark4. MATLAB
Studio
Artificial Intelligence uses logic and Machine Learning uses statistical Data Science deals with
decision trees. models. structured and unstructured
data.
Chatbots, and Voice assistants are Recommendation Systems such Fraud Detection and
popular applications of AI. as Spotify and Facial Recognition Healthcare analysis are
are popular examples. popular examples of Data
Science.
Following best practices when building machine-learning models is a time-consuming yet important
process. There are so many things to do ranging from: preparing the data, selecting and training
algorithms, understanding how the algorithm is making decisions, all the way down to deploying models to
production. I like to think of the machine learning design and maintenance process as being comprised of
ten steps (see the diagram above). However, if I want to save time, increase accuracy, and reduce risk, I do
not manually go through the entire machine learning process in order to build my machine learning
models. Instead, I turn to automated machine learning, using clever software that knows how to automate
the repetitive and mundane steps, and freeing me up to do what humans are best at: communication,
applying common sense, and being creative. In addition, to get the most out of automated machine
learning, I want it to automate each and every one of the 10 steps (see diagram below). Therefore, here is
my guide to what to look for in an automated machine learning system.
Preprocessing of Data
Feature Engineering
Diverse Algorithms
Algorithm Selection
Training and Tuning
Ensemble
Head-to-Head Model Competitions
Human-Friendly Insights
Easy Deployment
Model Monitoring and Management
Supervised Learning
Supervised learning, also known as supervised machine learning, is defined by its use of labeled datasets to
train algorithms to classify data or predict outcomes accurately. As input data is fed into the model, the
model adjusts its weights until it has been fitted appropriately. This occurs as part of the cross validation
process to ensure that the model avoids overfitting or underfitting. Supervised learning helps organizations
solve a variety of real-world problems at scale, such as classifying spam in a separate folder from your
inbox. Some methods used in supervised learning include neural networks, naïve bayes, linear regression,
logistic regression, random forest, and support vector machine (SVM).
Supervised learning uses a training set to teach models to yield the desired output. This training dataset
includes inputs and correct outputs, which allow the model to learn over time. The algorithm measures its
accuracy through the loss function, adjusting until the error has been sufficiently minimized. Supervised
learning can be separated into two types of problems when data mining-classification and regression:
1. Classification uses an algorithm to accurately assign test data into specific categories. It recognizes
specific entities within the dataset and attempts to draw some conclusions on how those entities
should be labeled or defined. Common classification algorithms are linear classifiers, support vector
machines (SVM), decision trees, k-nearest neighbor, and random forest, which are described in
more detail below.
2. Regression is used to understand the relationship between dependent and independent variables. It
is commonly used to make projections, such as for sales revenue for a given business. Linear
regression, logistical regression, and polynomial regression are popular regression algorithms.
Linear regression
It is used to identify the relationship between two variables, typically used for making future predictions.
Moreover, linear regression is sub-divided based on the number of independent and dependent variables.
For instance, if there is one independent and one dependent variable, it is known as simple linear
regression. Meanwhile, if there are two or more independent and dependent variables, it is called multiple
linear regression.
Logistic regression
Logistic regression is used when the dependent variable is categorical or has binary outputs like ‘yes’ or
‘no’. Moreover, logistic regression is used to solve binary classification problems; that is why it predicts
discreet values for variables.
2. Naive Bayes
A Naive Bayes algorithm is used for large datasets. The approach works on the fundamental that every
programmer in the algorithm works independently. This means that the presence of one feature will not
impact the other. Generally, it is used in text classification, recommendation systems, and others.
There are different types of Naive Bayes models, and the decision tree remains the most popular among
business organizations. A decision tree is a unique supervised learning algorithm structurally resembling a
flowchart. However, they fundamentally perform different roles and responsibilities.
A decision tree consists of control statements containing decisions and their consequences. The output in a
decision tree relates to the labelling of unforeseen data. ID3 and CART are some of the popular decision
tree algorithms widely used across various industries.
3. Classification
It is a type of supervised learning algorithm that accurately assigns data into different categories or classes.
It recognizes specific entities and analyses them to conclude where those entities must be categorized.
Some of the classification algorithms are as follows:
K-nearest neighbor
Random forest
Support vector machines
Decision tree
Linear classifiers
4. Neutral networks
This type of supervised learning algorithm is used to group or categorize raw data. In addition, it is used for
finding a pattern or interpreting sensory data. However, the algorithm requires numerous amounts of
computation resources. As a result, its uses are constrained.
5. Random forest
A random forest algorithm is often called an ensemble method because it combines different supervised
learning methods to conclude. Moreover, it uses many decision trees to output the classification of
individual trees. As a result, it is widely used across industries.
Below are some assumptions that we made while using the decision tree:
At the beginning, we consider the completely training set as the root. Feature values are preferred to be
categorical. If the values are continuous then they are discretized prior to building the model.
Based on attribute values, records are distributed recursively. We use statistical methods for ordering
attributes as root or the internal node.
As you can see from the above image the Decision Tree works on the Sum of Product form which is also
known as Disjunctive Normal Form. In the above image, we are predicting the use of computer in the daily
life of people. In the Decision Tree, the major challenge is the identification of the attribute for the root
node at each level. This process is known as attribute selection.
Bagging in ensemble machine learning takes several weak models, aggregating the predictions to select the
best prediction. The weak models specialize in distinct sections of the feature space, which enables bagging
advantage predictions to come from every model to reach the utmost purpose.
What is Bootstrapping?
Bagging is composed of two parts: aggregation and bootstrapping. Bootstrapping is a sampling method,
where a sample is chosen out of a set, using the replacement method. The learning algorithm is then run
on the samples selected. The bootstrapping technique uses sampling with replacements to make the
selection procedure completely random. When a sample is selected without replacement, the subsequent
selections of variables are always dependent on the previous selections, making the criteria non-random.
What is Aggregation?
Model predictions undergo aggregation to combine them for the final prediction to consider all the
possible outcomes. The aggregation can be done based on the total number of outcomes or the probability
of predictions derived from the bootstrapping of every model in the procedure.
Random forest is an ensemble method – a technique where we take many base-level models and combine
them to get improved results. Therefore, to understand how it operates, we first need to look at it is
components – decision trees – and how they work.
Some of the most common types of this algorithm are classification and regression trees.
Classification trees
The technique is used to determine which “class” a target variable is most likely to belong to. Thus, you can
determine who will or will not subscribe to a streaming platform or who will drop out of college, and who
will finish their studies successfully.
In unsupervised learning, we aim to extend the characteristics of certain data points to their neighbors by
assuming that the similarities between them are not limited to some specific features only. For example, in
a recommendation system, a group of users can be grouped based on their interests in certain movies. If
the chosen criteria detected analogies between the 2 users, we can share the non-overlapping elements
between the users.
Hierarchical Clustering
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group the
unlabeled datasets into a cluster and also known as hierarchical cluster analysis or HCA.
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped structure
is known as the dendrogram.
Sometimes the results of K-means clustering and hierarchical clustering may look similar, but they both
differ depending on how they work. As there is no requirement to predetermine the number of clusters as
we did in the K-Means algorithm.
The hierarchical clustering technique has two approaches:
1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with taking all
data points as single clusters and merging them until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down
approach.
1. Bagging
Bagging, the short form for bootstrap aggregating, is mainly applied in classification and regression. It
increases the accuracy of models through decision trees, which reduces variance to a large extent. The
reduction of variance increases accuracy, eliminating over fitting, which is a challenge to many predictive
models. Bagging is classified into two types, i.e., bootstrapping and aggregation. Bootstrapping is a
sampling technique where samples are derived from the whole population (set) using the replacement
procedure. The sampling with replacement method helps make the selection procedure randomized. The
base learning algorithm is run on the samples to complete the procedure.
Aggregation in bagging is done to incorporate all possible outcomes of the prediction and randomize the
outcome. Without aggregation, predictions will not be accurate because all outcomes are not put into
consideration. Therefore, the aggregation is based on the probability bootstrapping procedures or on the
basis of all outcomes of the predictive models.
Bagging is advantageous since weak base learners are combined to form a single strong learner that is
more stable than single learners. It also eliminates any variance, thereby reducing the overfitting of
models. One limitation of bagging is that it is computationally expensive. Thus, it can lead to more bias in
models when the proper procedure of bagging is ignored.
2. Boosting
Boosting is an ensemble technique that learns from previous predictor mistakes to make better predictions
in the future. The technique combines several weak base learners to form one strong learner, thus
significantly improving the predictability of models. Boosting works by arranging weak learners in a
sequence, such that weak learners learn from the next learner in the sequence to create better predictive
models.
Boosting takes many forms, including gradient boosting, Adaptive Boosting (AdaBoost), and XGBoost
(Extreme Gradient Boosting). AdaBoost uses weak learners in the form of decision trees, which mostly
include one split that is popularly known as decision stumps. AdaBoost’s main decision stump comprises
observations carrying similar weights.
Gradient boosting adds predictors sequentially to the ensemble, where preceding predictors correct their
successors, thereby increasing the model’s accuracy. New predictors are fit to counter the effects of errors
in the previous predictors. The gradient of descent helps the gradient booster identify problems in learners’
predictions and counter them accordingly.
XGBoost makes use of decision trees with boosted gradient, providing improved speed and performance. It
relies heavily on the computational speed and the performance of the target model. Model training should
follow a sequence, thus making the implementation of gradient boosted machines slow.
3. Stacking
Stacking, another ensemble method is often referred to as stacked generalization. This technique works by
allowing a training algorithm to ensemble several other similar learning algorithm predictions. Stacking has
been successfully implemented in regression, density estimations, distance learning, and classifications. It
can also be used to measure the error rate involved during bagging.
Gradient Boosting
Gradient boosting is a technique used in creating models for prediction. The technique is mostly used in
regression and classification procedures. Prediction models are often presented as decision trees for
choosing the best prediction. Gradient boosting presents model building in stages, just like other boosting
methods, while allowing the generalization and optimization of differentiable loss functions. The concept of
gradient boosting originated from American statistician Leo Breiman, who discovered that the technique
could be applied to appropriate cost functions as an optimization algorithm. The method has undergone
further development to optimize cost functions by iteratively picking weak hypotheses or a function with a
negative gradient.
Data Splitting Strategies
One of the first decisions to make when starting a modeling project is how to utilize the existing data. One
common technique is to split the data into two groups typically referred to as
the training and testing sets23. The training set is used to develop models and feature sets; they are the
substrate for estimating parameters, comparing models, and all of the other activities required to reach a
final model. The test set is used only at the conclusion of these activities for estimating a final, unbiased
assessment of the model’s performance. It is critical that the test set not be used prior to this point.
Looking at the test sets results would bias the outcomes since the testing data will have become part of the
model development process.
How much data should be set aside for testing? It is extremely difficult to make a uniform guideline. The
proportion of data can be driven by many factors, including the size of the original pool of samples and the
total number of predictors. With a large pool of samples, the criticality of this decision is reduced once
“enough” samples are included in the training set. Also, in this case, alternatives to a simple initial split of
the data might be a good idea; see Section 3.4.7 below for additional details. The ratio of the number of
samples (nn) to the number of predictors (pp) is important to consider, too. We will have much more
flexibility in splitting the data when nn is much greater than pp. However when nn is less than pp, then we
can run into modeling difficulties even if nn is seemingly large.
There are a number of ways to split the data into training and testing sets. The most common approach is
to use some version of random sampling. Completely random sampling is a straightforward strategy to
implement and usually protects the process from being biased towards any characteristic of the data.
However this approach can be problematic when the response is not evenly distributed across the
outcome. A less risky splitting strategy would be to use a stratified random sample based on the outcome.
For classification models, this is accomplished by selecting samples at random within each class. This
approach ensures that the frequency distribution of the outcome is approximately equal within the training
and test sets. When the outcome is numeric, artificial strata can be constructed based on the quartiles of
the data. For example, in the Ames housing price data, the quartiles of the outcome distribution would
break the data into four artificial groups containing roughly 230 houses. The training/test split would then
be conducted within these four groups and the four different training set portions are pooled together (and
the same for the test set).
Figure 1 image shows the different models of one plus phone. If a person is looking for one plus 7 mobile
then, one plus 7T and one plus 7 Pro is recommended to the user.
Apriori Algorithm
The Apriori algorithm uses frequent itemsets to generate association rules, and it is designed to work on
the databases that contain transactions. With the help of these association rule, it determines how strongly
or how weakly two objects are connected. This algorithm uses a breadth-first search and Hash Tree to
calculate the itemset associations efficiently. It is the iterative process for finding the frequent itemsets
from the large dataset.
This algorithm was given by the R. Agrawal and Srikant in the year 1994. It is mainly used for market basket
analysis and helps to find those products that can be bought together. It can also be used in the healthcare
field to find drug reactions for patients.
What is Python?
Python is a programming language that is preferred for programming due to its vast features, applicability,
and simplicity. The Python programming language best fits machine learning due to its independent
platform and its popularity in the programming community.
Machine learning is a section of Artificial Intelligence (AI) that aims at making a machine learn from
experience and automatically do the work without necessarily being programmed on a task. On the other
hand, Artificial Intelligence (AI) is the broader meaning of machine learning, where computers are made to
be receptive to the human level by recognizing visually, by speech, language translation, and consequently
making critical decisions.
In case you do not have Java installed on your system, then Install Java before proceeding to next step.
2. Verifying Scala installation
You should Scala language to implement Spark. So let us verify Scala installation using following command.
$scala –version
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
In case you don’t have Scala installed on your system, then proceed to next step for Scala installation.
3. Downloading Scala
we are using scala-2.11.6 version. After downloading, you will find the Scala tar file in the download folder.
4. Installing Scala
Follow the below given steps for installing Scala.
Extract the Scala tar file type the following command for extracting the Scala tar file.
$ tar xvf scala-2.11.6.tgz
Move Scala software files use the following commands for moving the Scala software files, to respective
directory (/usr/local/scala).
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv scala-2.11.6 /usr/local/scala
# exit
Set PATH for Scala
Use the following command for setting PATH for Scala.
$ export PATH = $PATH:/usr/local/scala/bin
After installation, it is better to verify it. Use the following command for verifying Scala installation.
$scala –version
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
5. Downloading Apache Spark
we are using spark-1.3.1-bin-hadoop2.6 versions. After downloading it, you will find the Spark tar file in the
download folder.
6. Installing Spark
Follow the steps given below for installing Spark.
The following command for extracting the spark tar file.
$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz
The following commands for moving the Spark software files to respective directory (/usr/local/spark).
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit
Add the following line to ~/.bashrc file. It means adding the location, where the spark software file are
located to the PATH variable.
export PATH=$PATH:/usr/local/spark/bin
Use the following command for sourcing the ~/.bashrc file.
$ source ~/.bashrc
7. Verifying the Spark Installation
Write the following command for opening Spark shell.
$spark-shell
If spark is installed successfully then you will find the following output.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;
ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions:
Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>
Spark Architecture
The Apache Spark base architecture diagram is provided in the following figure:
When the Driver Program in the Apache Spark architecture executes, it calls the real program of an
application and creates a SparkContext. SparkContext contains all of the basic functions. The Spark Driver
includes several other components, including a DAG Scheduler, Task Scheduler, Backend Scheduler, and
Block Manager, all of which are responsible for translating user-written code into jobs that are actually
executed on the cluster.
The Cluster Manager manages the execution of various jobs in the cluster. Spark Driver works in
conjunction with the Cluster Manager to control the execution of various other jobs. The cluster Manager
does the task of allocating resources for the job. Once the job has been broken down into smaller jobs,
which are then distributed to worker nodes, SparkDriver will control the execution.
Many worker nodes can be used to process an RDD created in the SparkContext, and the results can also
be cached.
The Spark Context receives task information from the Cluster Manager and enqueues it on worker nodes.
The executor is in charge of carrying out these duties. The lifespan of executors is the same as that of the
Spark Application. We can increase the number of workers if we want to improve the performance of the
system. In this way, we can divide jobs into more coherent parts.
Resilient Distributed Dataset?
A Resilient Distributed Dataset (RDD) is a low-level API and Spark's underlying data abstraction. An RDD is a
static set of items distributed across clusters to allow parallel processing. The data structure stores any
Python, Java, Scala, or user-created object.