Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 23

Lecture 1.

Introduction to Machine Learning

What is Machine Learning?

Machine learning is a branch of artificial intelligence (AI) and computer science, which focuses on the use
of data and algorithms to imitate the way that humans learn, gradually improving its accuracy. Machine
learning (ML) is a type of artificial intelligence (AI) that allows software applications to become more
accurate at predicting outcomes without being explicitly programmed to do so. Machine
learning algorithms use historical data as input to predict new output values.

Over the last couple of decades, the technological advances in storage and processing power have enabled
some innovative products based on machine learning, such as Netflix’s recommendation engine and self-
driving cars.

Machine learning models fall into three primary categories.

1. Supervised machine learning            

Supervised learning, also known as supervised machine learning, is defined by its use of labeled datasets to
train algorithms to classify data or predict outcomes accurately. As input data is fed into the model, the
model adjusts its weights until it has been fitted appropriately. This occurs as part of the cross validation
process to ensure that the model avoids over fitting or under fitting. Supervised learning helps organizations
solve a variety of real-world problems at scale, such as classifying spam in a separate folder from your
inbox. Some methods used in supervised learning include neural networks, naïve bayes, linear regression,
logistic regression, random forest, and support vector machine (SVM).

2. Unsupervised machine learning

Unsupervised learning, also known as unsupervised machine learning, uses machine-learning algorithms to
analyze and cluster unlabeled datasets. These algorithms discover hidden patterns or data groupings
without the need for human intervention. This method’s ability to discover similarities and differences in
information make it ideal for exploratory data analysis, cross-selling strategies, customer segmentation,
and image and pattern recognition. It is also used to reduce the number of features in a model through the
process of dimensionality reduction. Principal component analysis (PCA) and singular value decomposition
(SVD) are two common approaches for this. Other algorithms used in unsupervised learning include neural
networks, k-means clustering, and probabilistic clustering methods.

3. Semi-supervised learning 

Semi-supervised learning offers a happy medium between supervised and unsupervised learning. During
training, it uses a smaller labeled data set to guide classification and feature extraction from a larger,
unlabeled data set. Semi-supervised learning can solve the problem of not having enough labeled data for a
supervised learning algorithm. It also helps if it is too costly to label enough data. 

Machine Learning Skills 

Some machine learning skills include:

 Ability to identify patterns in data 


 Ability to build models to make predictions 
 Ability to tune model parameters to optimize performance 
 Ability to evaluate models for accuracy 
 Ability to work with large data sets

Difference between Data Science, Artificial Intelligence, and Machine Learning

Although the terms Data Science vs. Machine Learning vs. Artificial Intelligence might be related
and interconnected, each is unique and is used for different purposes. Data Science is a broad term, and
Machine Learning falls within it. Here’s the critical difference between the terms. 

Artificial Intelligence  Machine Learning Data Science

Includes Machine Learning. Subset of Artificial Intelligence. Includes various Data


Operations.

Artificial Intelligence combines Machine Learning uses efficient Data Science works by
large amounts of data through programs that can use data sourcing, cleaning, and
iterative processing and intelligent without being explicitly told to do processing data to extract
algorithms to help computers learn so. meaning out of it for
automatically. analytical purposes. 

Some of the popular tools that AI The popular tools that Machine Some of the popular tools
uses are- Learning makes use of are-1. used by Data Science are-1.
1. TensorFlow2. Scikit Learn Amazon Lex2. IBM Watson SAS2. Tableau3. Apache
3. Keras Studio3. Microsoft Azure ML Spark4. MATLAB
Studio

Artificial Intelligence uses logic and Machine Learning uses statistical Data Science deals with
decision trees.  models.  structured and unstructured
data. 

Chatbots, and Voice assistants are Recommendation Systems such Fraud Detection and
popular applications of AI.  as Spotify and Facial Recognition Healthcare analysis are
are popular examples. popular examples of Data
Science. 

Definition and Features of Machine Learning

Following best practices when building machine-learning models is a time-consuming yet important
process. There are so many things to do ranging from: preparing the data, selecting and training
algorithms, understanding how the algorithm is making decisions, all the way down to deploying models to
production. I like to think of the machine learning design and maintenance process as being comprised of
ten steps (see the diagram above). However, if I want to save time, increase accuracy, and reduce risk, I do
not manually go through the entire machine learning process in order to build my machine learning
models. Instead, I turn to automated machine learning, using clever software that knows how to automate
the repetitive and mundane steps, and freeing me up to do what humans are best at: communication,
applying common sense, and being creative. In addition, to get the most out of automated machine
learning, I want it to automate each and every one of the 10 steps (see diagram below). Therefore, here is
my guide to what to look for in an automated machine learning system.
 Preprocessing of Data
 Feature Engineering
 Diverse Algorithms
 Algorithm Selection
 Training and Tuning
 Ensemble
 Head-to-Head Model Competitions
 Human-Friendly Insights
 Easy Deployment
 Model Monitoring and Management

Lecture 2. Supervised Learning Regression and Classification

Supervised Learning

Supervised learning, also known as supervised machine learning, is defined by its use of labeled datasets to
train algorithms to classify data or predict outcomes accurately. As input data is fed into the model, the
model adjusts its weights until it has been fitted appropriately. This occurs as part of the cross validation
process to ensure that the model avoids overfitting or underfitting. Supervised learning helps organizations
solve a variety of real-world problems at scale, such as classifying spam in a separate folder from your
inbox. Some methods used in supervised learning include neural networks, naïve bayes, linear regression,
logistic regression, random forest, and support vector machine (SVM).

How supervised learning works

Supervised learning uses a training set to teach models to yield the desired output. This training dataset
includes inputs and correct outputs, which allow the model to learn over time. The algorithm measures its
accuracy through the loss function, adjusting until the error has been sufficiently minimized. Supervised
learning can be separated into two types of problems when data mining-classification and regression:

1. Classification uses an algorithm to accurately assign test data into specific categories. It recognizes
specific entities within the dataset and attempts to draw some conclusions on how those entities
should be labeled or defined. Common classification algorithms are linear classifiers, support vector
machines (SVM), decision trees, k-nearest neighbor, and random forest, which are described in
more detail below.
2. Regression is used to understand the relationship between dependent and independent variables. It
is commonly used to make projections, such as for sales revenue for a given business. Linear
regression, logistical regression, and polynomial regression are popular regression algorithms.

Types of Supervised Learning


Various types of algorithms and computation methods are used in the supervised learning process. Below
are some of the common types of supervised learning algorithms:
1. Regression
Regression is used to understand the relationship between dependable and independent variables.
Moreover, it is a type of supervised learning that learns from labelled data sets to predict continuous
output for different data in an algorithm. It is believed to be widely used in scenarios where the output
needs to be a finite value, for instance, height or weight, etc.
There two types of regression; they are as follows:

 Linear regression
It is used to identify the relationship between two variables, typically used for making future predictions.
Moreover, linear regression is sub-divided based on the number of independent and dependent variables.
For instance, if there is one independent and one dependent variable, it is known as simple linear
regression. Meanwhile, if there are two or more independent and dependent variables, it is called multiple
linear regression.
 Logistic regression
Logistic regression is used when the dependent variable is categorical or has binary outputs like ‘yes’ or
‘no’. Moreover, logistic regression is used to solve binary classification problems; that is why it predicts
discreet values for variables.
2. Naive Bayes
A Naive Bayes algorithm is used for large datasets. The approach works on the fundamental that every
programmer in the algorithm works independently. This means that the presence of one feature will not
impact the other. Generally, it is used in text classification, recommendation systems, and others.
There are different types of Naive Bayes models, and the decision tree remains the most popular among
business organizations. A decision tree is a unique supervised learning algorithm structurally resembling a
flowchart. However, they fundamentally perform different roles and responsibilities.
A decision tree consists of control statements containing decisions and their consequences. The output in a
decision tree relates to the labelling of unforeseen data. ID3 and CART are some of the popular decision
tree algorithms widely used across various industries.

3. Classification
It is a type of supervised learning algorithm that accurately assigns data into different categories or classes.
It recognizes specific entities and analyses them to conclude where those entities must be categorized.
Some of the classification algorithms are as follows:
 K-nearest neighbor
 Random forest
 Support vector machines
 Decision tree
 Linear classifiers
4. Neutral networks
This type of supervised learning algorithm is used to group or categorize raw data. In addition, it is used for
finding a pattern or interpreting sensory data. However, the algorithm requires numerous amounts of
computation resources. As a result, its uses are constrained.

5. Random forest
A random forest algorithm is often called an ensemble method because it combines different supervised
learning methods to conclude. Moreover, it uses many decision trees to output the classification of
individual trees. As a result, it is widely used across industries.

Types Classification Algorithms in Machine Learning


The study of classification in statistics is vast, and there are several types of classification algorithms you
can use depending on the dataset you are working with. Classification is one of the most important aspects
of supervised learning. Below are eight of the most common algorithms in machine learning.
 Logistic Regression Algorithm
 Naïve Bayes Algorithm
 Decision Tree Algorithm
 K-Nearest Neighbors Algorithm
 Support Vector Machine Algorithm
 Random Forest Algorithm
 Stochastic Gradient Descent Algorithm
 Kernel Approximation Algorithm

Lecture 3. Decision Trees and Random Forest

What is Decision Tree?


Decision tree algorithm falls under the category of supervised learning. They can be used to solve both
regression and classification problems. Decision tree uses the tree representation to solve the problem in
which each leaf node corresponds to a class label and attributes are represented on the internal node of
the tree. We can represent any Boolean function on discrete attributes using the decision tree.

Below are some assumptions that we made while using the decision tree:
At the beginning, we consider the completely training set as the root. Feature values are preferred to be
categorical. If the values are continuous then they are discretized prior to building the model.
Based on attribute values, records are distributed recursively. We use statistical methods for ordering
attributes as root or the internal node.
As you can see from the above image the Decision Tree works on the Sum of Product form which is also
known as Disjunctive Normal Form. In the above image, we are predicting the use of computer in the daily
life of people. In the Decision Tree, the major challenge is the identification of the attribute for the root
node at each level. This process is known as attribute selection.

What is Bagging (Bootstrap Aggregation)?


Ensemble machine learning can be mainly categorized into bagging and boosting. The bagging technique is
useful for both regression and statistical classification. Bagging is used with decision trees, where it
significantly raises the stability of models in improving accuracy and reducing variance, which eliminates
the challenge of overfitting.

Bagging in ensemble machine learning takes several weak models, aggregating the predictions to select the
best prediction. The weak models specialize in distinct sections of the feature space, which enables bagging
advantage predictions to come from every model to reach the utmost purpose.

What is Bootstrapping?
Bagging is composed of two parts: aggregation and bootstrapping. Bootstrapping is a sampling method,
where a sample is chosen out of a set, using the replacement method. The learning algorithm is then run
on the samples selected. The bootstrapping technique uses sampling with replacements to make the
selection procedure completely random. When a sample is selected without replacement, the subsequent
selections of variables are always dependent on the previous selections, making the criteria non-random.

What is Aggregation?
Model predictions undergo aggregation to combine them for the final prediction to consider all the
possible outcomes. The aggregation can be done based on the total number of outcomes or the probability
of predictions derived from the bootstrapping of every model in the procedure.

What is an Ensemble Method?


Both bagging and boosting form the most prominent ensemble techniques. An ensemble method is a
machine-learning platform that helps multiple models in training by using the same learning algorithm. The
ensemble method is a participant of a bigger group of multi-classifiers.
Multi-classifiers are a group of multiple learners, running into thousands, with a common goal that can fuse
and solve a common problem. Another category of multi-classifiers is hybrid methods. The hybrid methods
use a set of learners, but they can use distinct learning methods, unlike the multi-classifiers.

What is a random forest?


A random forest is a supervised machine-learning algorithm in which the calculations of numerous decision
trees are combined to produce one result. It is popular because it is simple yet effective.

Random forest is an ensemble method – a technique where we take many base-level models and combine
them to get improved results. Therefore, to understand how it operates, we first need to look at it is
components – decision trees – and how they work.

Decision trees: the trees of random forests


Decision trees are a category of machine learning algorithms used for tasks like classification and
regression. These algorithms take data and create models that are similar to decision trees that you might
have encountered in other fields. A decision tree model takes some input data and follows a series of
branching steps until it reaches one of the predefined output values.

Some of the most common types of this algorithm are classification and regression trees.

Classification trees
The technique is used to determine which “class” a target variable is most likely to belong to. Thus, you can
determine who will or will not subscribe to a streaming platform or who will drop out of college, and who
will finish their studies successfully.

How does the random forest algorithm work?


Now that we know what a single decision tree is and how it can be trained, we are ready to train a whole
forest of them.
Let see how the process happens systematic.
1. Split the dataset into subsets
A random forest is an ensemble of decision trees. To create many decision trees, we must divide the
dataset we have into subsets.
There are two main ways to do this: you can randomly choose on which features to train each tree
(random feature subspaces) and take a sample with replacement from the features chosen (bootstrap
sample).
2. Train decision trees
After we have split the dataset into subsets, we train decision trees on these subsets. The process of
training is the same as it would be for training an individual tree – we just make a lot more of them.
It might be interesting to know that this training process is very scalable: since the trees are independent,
you can parallelize the training process easily.
3. Aggregate the results
Each individual tree contains one result that depends on the tree’s initial data. To get rid of the
dependence on the initial data and produce a more accurate estimation, we combine their output into one
result. Different methods of aggregating the results can be used. For example, in the case of classification,
voting by performance is used quite often, whereas for regression, averaging models are applied.
4. Validate the model
After we complete the training procedure with the training data and run the tests with the test dataset, we
perform the holdout validation procedure. This involves training a new model with the same hyper
parameters. In particular, these include the number of trees, the pruning and training procedures, the split
function, etc.
Note that the objective of training is not to find the specific model instance that is most suitable for us. The
goal is to develop a general model without pretrained parameters and find the most appropriate training
procedure in terms of metrics: accuracy, over fitting resistance, memory, and other generic parameters.

Lecture 4. Unsupervised Learning

What is Unsupervised Learning?


Unsupervised learning is based on problems where you do not get any information about the output
values. Since the data is not labeled, this approach is useful when there is a need to learn how a set of
items can be grouped based on the similarities between them. Unsupervised learning provides an implicit
descriptive analysis of the pieces of information uncovered by any clustering algorithm that can be used to
obtain complete information from an unlabeled dataset.

In unsupervised learning, we aim to extend the characteristics of certain data points to their neighbors by
assuming that the similarities between them are not limited to some specific features only. For example, in
a recommendation system, a group of users can be grouped based on their interests in certain movies. If
the chosen criteria detected analogies between the 2 users, we can share the non-overlapping elements
between the users.

Applications of Unsupervised Learning


Unsupervised learning enables systems to identify patterns within datasets with AI algorithms that are
otherwise unlabeled or unclassified. There are numerous applications of unsupervised learning examples,
with some common examples including recommendation systems, products segmentation, data set
labeling, customer segmentation, and similarity detection.
Hope you now understand what unsupervised learning is in machine learning. In short, it means finding
similarities between an unlabeled dataset. Some of the common applications where unsupervised learning
is used are:
1. Products Segmentation
2. Customer Segmentation
3. Similarity Detection
4. Recommendation Systems
5. Labeling unlabeled datasets

Hierarchical Clustering
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group the
unlabeled datasets into a cluster and also known as hierarchical cluster analysis or HCA.
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped structure
is known as the dendrogram.
Sometimes the results of K-means clustering and hierarchical clustering may look similar, but they both
differ depending on how they work. As there is no requirement to predetermine the number of clusters as
we did in the K-Means algorithm.
The hierarchical clustering technique has two approaches:
1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with taking all
data points as single clusters and merging them until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down
approach.

Why hierarchical clustering?


As we already have other clustering algorithms such as K-Means Clustering, then why we need hierarchical
clustering? So, as we have seen in the K-means clustering that there are some challenges with this
algorithm, which are a predetermined number of clusters, and it always tries to create the clusters of the
same size. To solve these two challenges, we can opt for the hierarchical clustering algorithm because, in
this algorithm, we don't need to have knowledge about the predefined number of clusters.

K-Means Clustering Algorithm


K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in
machine learning or data science. In this topic, we will learn what is K-means clustering algorithm, how the
algorithm works, along with the Python implementation of k-means clustering.

What is K-Means Algorithm?


K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into
different clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that
each dataset belongs only one group that has similar properties. It allows us to cluster the data into
different groups and a convenient way to discover the categories of groups in the unlabeled dataset on its
own without the need for any training. It is a centroid-based algorithm, where each cluster is associated
with a centroid. The main aim of this algorithm is to minimize the sum of distances between the data point
and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in this
algorithm.
The k-means clustering algorithm mainly performs two tasks:
1. Determines the best value for K center points or centroids by an iterative process.
2. Assigns each data point to its closest k-center. Those data points which are near to the particular k-
center create a cluster.
Hence each cluster has data points with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Lecture 5. Time Series Analysis
What is Time Series?
Time series analysis is a specific way of analyzing a sequence of data points collected over an interval of
time. In time series analysis, analysts record data points at consistent intervals over a set period of time
rather than just recording the data points intermittently or randomly. However, this type of analysis is not
merely the act of collecting data over time.
What sets time series data apart from other data is that the analysis can show how variables change over
time? In other words, time is a crucial variable because it shows how the data adjusts over the course of
the data points as well as the final results. It provides an additional source of information and a set order of
dependencies between the data.
Time series analysis typically requires a large number of data points to ensure consistency and reliability.
An extensive data set ensures you have a representative sample size and that analysis can cut through
noisy data. It also ensures that any trends or patterns discovered are not outliers and can account for
seasonal variance. Additionally, time series data can be used for forecasting—predicting future data based
on historical data.

Time Series Analysis Types


Because time series analysis includes many categories or variations of data, analysts sometimes must make
complex models. However, analysts can’t account for all variances, and they can’t generalize a specific
model to every sample. Models that are too complex or that try to do too many things can lead to a lack of
fit. Lack of fit or over fitting models lead to those models not distinguishing between random error and true
relationships, leaving analysis skewed and forecasts incorrect.
Models of time series analysis include:
1. Classification: Identifies and assigns categories to the data.
2. Curve fitting: Plots the data along a curve to study the relationships of variables within the data.
3. Descriptive analysis: Identifies patterns in time series data, like trends, cycles, or seasonal variation.
4. Explanative analysis: Attempts to understand the data and the relationships within it, as well as
cause and effect.
5. Exploratory analysis: Highlights the main characteristics of the time series data, usually in a visual
format.
6. Forecasting: Predicts future data. This type is based on historical trends. It uses the historical data
as a model for future data, predicting scenarios that could happen along future plot points.
7. Intervention analysis: Studies how an event can change the data.
8. Segmentation: Splits the data into segments to show the underlying properties of the source
information.

Time Series Analysis Models and Techniques


Just as there are many types and models, there are also a variety of methods to study data. Here are the
three most common.
1. Box-Jenkins ARIMA models: These univariate models are used to better understand a single time-
dependent variable, such as temperature over time, and to predict future data points of variables.
These models work on the assumption that the data is stationary. Analysts have to account for and
remove as many differences and seasonality in past data points as they can. Thankfully, the ARIMA
model includes terms to account for moving averages, seasonal difference operators, and
autoregressive terms within the model.
2. Box-Jenkins Multivariate Models: Multivariate models are used to analyze more than one time-
dependent variable, such as temperature and humidity, over time.
3. Holt-Winters Method: The Holt-Winters method is an exponential smoothing technique. It is
designed to predict outcomes, provided that the data points include seasonality.

Lecture 6. Ensemble Learning

What Is Ensemble learning?


Ensemble learning is a general Meta approach to machine learning that seeks better predictive
performance by combining the predictions from multiple models.
The ensemble methods in machine learning combine the insights obtained from multiple learning models
to facilitate accurate and improved decisions. These methods follow the same principle as the example of
buying an air-conditioner cited above.
In learning models, noise, variance, and bias are the major sources of error. The ensemble methods in
machine learning help minimize these error-causing factors, thereby ensuring the accuracy and stability of
machine learning (ML) algorithms.

Ensemble Learning Methods


Ensemble methods are techniques that aim at improving the accuracy of results in models by combining
multiple models instead of using a single model. The combined models increase the accuracy of the results
significantly. This has boosted the popularity of ensemble methods in machine learning.
Main Types of Ensemble Methods:

1. Bagging
Bagging, the short form for bootstrap aggregating, is mainly applied in classification and regression. It
increases the accuracy of models through decision trees, which reduces variance to a large extent. The
reduction of variance increases accuracy, eliminating over fitting, which is a challenge to many predictive
models. Bagging is classified into two types, i.e., bootstrapping and aggregation. Bootstrapping is a
sampling technique where samples are derived from the whole population (set) using the replacement
procedure. The sampling with replacement method helps make the selection procedure randomized. The
base learning algorithm is run on the samples to complete the procedure.
Aggregation in bagging is done to incorporate all possible outcomes of the prediction and randomize the
outcome. Without aggregation, predictions will not be accurate because all outcomes are not put into
consideration. Therefore, the aggregation is based on the probability bootstrapping procedures or on the
basis of all outcomes of the predictive models.
Bagging is advantageous since weak base learners are combined to form a single strong learner that is
more stable than single learners. It also eliminates any variance, thereby reducing the overfitting of
models. One limitation of bagging is that it is computationally expensive. Thus, it can lead to more bias in
models when the proper procedure of bagging is ignored.
2. Boosting
Boosting is an ensemble technique that learns from previous predictor mistakes to make better predictions
in the future. The technique combines several weak base learners to form one strong learner, thus
significantly improving the predictability of models. Boosting works by arranging weak learners in a
sequence, such that weak learners learn from the next learner in the sequence to create better predictive
models.
Boosting takes many forms, including gradient boosting, Adaptive Boosting (AdaBoost), and XGBoost
(Extreme Gradient Boosting). AdaBoost uses weak learners in the form of decision trees, which mostly
include one split that is popularly known as decision stumps. AdaBoost’s main decision stump comprises
observations carrying similar weights.
Gradient boosting adds predictors sequentially to the ensemble, where preceding predictors correct their
successors, thereby increasing the model’s accuracy. New predictors are fit to counter the effects of errors
in the previous predictors. The gradient of descent helps the gradient booster identify problems in learners’
predictions and counter them accordingly.
XGBoost makes use of decision trees with boosted gradient, providing improved speed and performance. It
relies heavily on the computational speed and the performance of the target model. Model training should
follow a sequence, thus making the implementation of gradient boosted machines slow.
3. Stacking
Stacking, another ensemble method is often referred to as stacked generalization. This technique works by
allowing a training algorithm to ensemble several other similar learning algorithm predictions. Stacking has
been successfully implemented in regression, density estimations, distance learning, and classifications. It
can also be used to measure the error rate involved during bagging.

Gradient Boosting
Gradient boosting is a technique used in creating models for prediction. The technique is mostly used in
regression and classification procedures. Prediction models are often presented as decision trees for
choosing the best prediction. Gradient boosting presents model building in stages, just like other boosting
methods, while allowing the generalization and optimization of differentiable loss functions. The concept of
gradient boosting originated from American statistician Leo Breiman, who discovered that the technique
could be applied to appropriate cost functions as an optimization algorithm. The method has undergone
further development to optimize cost functions by iteratively picking weak hypotheses or a function with a
negative gradient.
Data Splitting Strategies
One of the first decisions to make when starting a modeling project is how to utilize the existing data. One
common technique is to split the data into two groups typically referred to as
the training and testing sets23. The training set is used to develop models and feature sets; they are the
substrate for estimating parameters, comparing models, and all of the other activities required to reach a
final model. The test set is used only at the conclusion of these activities for estimating a final, unbiased
assessment of the model’s performance. It is critical that the test set not be used prior to this point.
Looking at the test sets results would bias the outcomes since the testing data will have become part of the
model development process.
How much data should be set aside for testing? It is extremely difficult to make a uniform guideline. The
proportion of data can be driven by many factors, including the size of the original pool of samples and the
total number of predictors. With a large pool of samples, the criticality of this decision is reduced once
“enough” samples are included in the training set. Also, in this case, alternatives to a simple initial split of
the data might be a good idea; see Section 3.4.7 below for additional details. The ratio of the number of
samples (nn) to the number of predictors (pp) is important to consider, too. We will have much more
flexibility in splitting the data when nn is much greater than pp. However when nn is less than pp, then we
can run into modeling difficulties even if nn is seemingly large.
There are a number of ways to split the data into training and testing sets. The most common approach is
to use some version of random sampling. Completely random sampling is a straightforward strategy to
implement and usually protects the process from being biased towards any characteristic of the data.
However this approach can be problematic when the response is not evenly distributed across the
outcome. A less risky splitting strategy would be to use a stratified random sample based on the outcome.
For classification models, this is accomplished by selecting samples at random within each class. This
approach ensures that the frequency distribution of the outcome is approximately equal within the training
and test sets. When the outcome is numeric, artificial strata can be constructed based on the quartiles of
the data. For example, in the Ames housing price data, the quartiles of the outcome distribution would
break the data into four artificial groups containing roughly 230 houses. The training/test split would then
be conducted within these four groups and the four different training set portions are pooled together (and
the same for the test set).

Simple model selection


Consider two simple examples where the model that is used to make a prediction is first selected using
some data-dependent procedure. In both cases, I would not recommend using data splitting in practice
since the model selection procedures are relatively simple and well defined and there is the possibility of
applying various whole-data methods for adjusting the inference for the model selection effect. Data
splitting would be more appropriate where the model selection procedure is either complex or dependent
on graphical methods which are difficult to precisely define. Nevertheless, if data splitting cannot work well
in these simple circumstances, then we can have little confidence in its value in more complicated
situations
Lecture 7. Recommendation System

What is a Recommendation System?


A recommendation system (or recommender system) is a class of machine learning that uses data to help
predict, narrow down, and find what people are looking for among an exponentially growing number of
options.
A recommendation system is an artificial intelligence or AI algorithm, usually associated with machine
learning, that uses Big Data to suggest or recommend additional products to consumers. These can be
based on various criteria, including past purchases, search history, demographic information, and other
factors. Recommender systems are highly useful as they help users discover products and services they
might otherwise have not found on their own.
Recommender systems are trained to understand the preferences, previous decisions, and characteristics
of people and products using data gathered about their interactions. These include impressions, clicks,
likes, and purchases. Because of their capability to predict consumer interests and desires on a highly
personalized level, recommender systems are a favorite with content and product providers. They can
drive consumers to just about any product or service that interests them, from books to videos to health
classes to clothing.

Types of Recommendation System


1. Popularity-Based Recommendation System
It is a type of recommendation system which works on the principle of popularity and or anything which is
in trend. These systems check about the product or movie which are in trend or are most popular among
the users and directly recommend those.
For example, if a product is often purchased by most people then the system will get to know that that
product is most popular so for every new user who just signed it, the system will recommend that product
to that user also and chances becomes high that the new user will also purchase that.
2. Classification Model
The model that uses features of both products as well as users to predict whether a user will like a product
or not.
The output can be either 0 or 1. If the user likes it then 1 and vice-versa.
3. Content-Based Recommendation System
It is another type of recommendation system which works on the principle of similar content. If a user is
watching a movie, then the system will check about other movies of similar content or the same genre of
the movie the user is watching. There are various fundamentals attributes that are used to compute the
similarity while checking about similar content.
To explain more about how exactly the system works, an example is stated below: 

Different models of one plus.

Figure 1 image shows the different models of one plus phone. If a person is looking for one plus 7 mobile
then, one plus 7T and one plus 7 Pro is recommended to the user.

Data Mining Market Basket Analysis


Market basket analysis is a data mining technique used by retailers to increase sales by better
understanding customer purchasing patterns. It involves analyzing large data sets, such as purchase history,
to reveal product groupings and products that are likely to be purchased together.
The adoption of market basket analysis was aided by the advent of electronic point-of-sale (POS) systems.
Compared to handwritten records kept by store owners, the digital records generated by POS systems
made it easier for applications to process and analyze large volumes of purchase data.
Implementation of market basket analysis requires a background in statistics and data science and some
algorithmic computer programming skills. For those without the needed technical skills, commercial, off-
the-shelf tools exist.
One example is the Shopping Basket Analysis tool in Microsoft Excel, which analyzes transaction data
contained in a spreadsheet and performs market basket analysis. A transaction ID must relate to the items
to be analyzed. The Shopping Basket Analysis tool then creates two worksheets:
1. The Shopping Basket Item Groups worksheet, which lists items that are frequently purchased
together,
2. And the Shopping Basket Rules worksheet shows how items are related (For example, purchasers of
Product A are likely to buy Product B).

Types of Market Basket Analysis


1. Descriptive market basket analysis: This type only derives insights from past data and is the most
frequently used approach. The analysis here does not make any predictions but rates the
association between products using statistical techniques. For those familiar with the basics of Data
Analysis, this type of modeling is known as unsupervised learning.
2. Predictive market basket analysis: This type uses supervised learning models like classification and
regression. It essentially aims to mimic the market to analyze what causes what to happen.
Essentially, it considers items purchased in a sequence to determine cross-selling. For example,
buying an extended warranty is more likely to follow the purchase of an iPhone. While it isn't as
widely used as a descriptive MBA, it is still a very valuable tool for marketers.
3. Differential market basket analysis: This type of analysis is beneficial for competitor analysis. It
compares purchase history between stores, between seasons, between two time periods, between
different days of the week, etc., to find interesting patterns in consumer behavior. For example, it
can help determine why some users prefer to purchase the same product at the same price on
Amazon vs Flipkart. The answer can be that the Amazon reseller has more warehouses and can
deliver faster, or maybe something more profound like user experience.

Apriori Algorithm
The Apriori algorithm uses frequent itemsets to generate association rules, and it is designed to work on
the databases that contain transactions. With the help of these association rule, it determines how strongly
or how weakly two objects are connected. This algorithm uses a breadth-first search and Hash Tree to
calculate the itemset associations efficiently. It is the iterative process for finding the frequent itemsets
from the large dataset.
This algorithm was given by the R. Agrawal and Srikant in the year 1994. It is mainly used for market basket
analysis and helps to find those products that can be bought together. It can also be used in the healthcare
field to find drug reactions for patients.

Steps for Apriori Algorithm


Below are the steps for the apriori algorithm:
1. Determine the support of item sets in the transactional database, and select the minimum support
and confidence.
2. Take all supports in the transaction with higher support value than the minimum or selected
support value.
3. Find all the rules of these subsets that have higher confidence value than the threshold or minimum
confidence.
4. Sort the rules as the decreasing order of lift.

Lecture 8. Use of python in machine learning

What is Python?
Python is a programming language that is preferred for programming due to its vast features, applicability,
and simplicity. The Python programming language best fits machine learning due to its independent
platform and its popularity in the programming community.
Machine learning is a section of Artificial Intelligence (AI) that aims at making a machine learn from
experience and automatically do the work without necessarily being programmed on a task. On the other
hand, Artificial Intelligence (AI) is the broader meaning of machine learning, where computers are made to
be receptive to the human level by recognizing visually, by speech, language translation, and consequently
making critical decisions.

Advantages of Using Python


1. Independence across platforms
Due to its ability to run on multiple platforms without the need to change, developers prefer Python, unlike
in other programming languages. Python runs across different platforms, such as Windows, Linux, and
macOS, thus requiring little or no changes. The platforms are fully compatible with the Python
programming language, which means that there is little to no need for a Python expert to explain the
program’s code.
The ease of executability makes it easy to distribute software, allowing standalone software to be built and
run using Python. The software can be programmed from start to finish using Python as the only language.
It is a plus for developers since other programming languages require complementation by other languages
before the project is fully completed. Python’s independence across platforms saves time and resources for
developers, who would otherwise incur a lot of resources to complete a single project.
2. Consistency and simplicity
The Python programming language is a haven for most software developers looking for simplicity and
consistency in their work. The Python code is concise and readable, which simplifies the presentation
process. A developer can write code easily and concisely compare it to other programming languages. It
allows developers to receive input from other developers in the community to help enhance the software
or application.
The simplicity of the Python language makes it easy for beginners to master it quickly and with less effort
as compared to other programming languages. Also, experienced developers find it easy to create stable
and reliable systems, and they can focus their efforts on enhancing their creativity and solving real-world
problems using machine learning.
3. Frameworks and libraries variety
Libraries and frameworks are vital in the preparation of a suitable programming environment. Python
frameworks and libraries offer a reliable environment that reduces software development time
significantly. A library basically includes a prewritten code that developers can use to speed up coding
when working on complex projects.
Python includes a modular machine learning library known as PyBrain, which provides easy-to-use
algorithms for use in machine learning tasks. The best and most reliable coding solutions require a proper
structure and tested environment, which is available in the Python frameworks and libraries.

Python is Most Suitable for Machine Learning


Machine learning and AI, as a unit, are still developing but are rapidly growing in usage due to the need for
automation. Artificial Intelligence makes it possible to create innovative solutions to common problems,
such as fraud detection, personal assistants, spam filters, search engines, and recommendations systems.
The demand for smart solutions to real-world problems necessitates the need to develop AI further in
order to automate tasks that are tedious to program without AI. Python programming language is
considered the best algorithm to help automate such tasks, and it offers greater simplicity and consistency
than other programming languages. Further, the presence of an engaging python community makes it easy
for developers to discuss projects and contribute ideas on how to enhance their code.

Understanding of Python libraries for machine learning


When it comes to machine learning and deep learning projects written in Python, there are thousands of
libraries to pick and choose from. However, they’re not all on the same level of code quality, diversity, or
size. To help you choose, here are the best Python libraries for machine learning and deep learning.
1. NumPy
2. SciPy
3. Scikit-Learn
4. Theano
5. TensorFlow
6. Keras
7. PyTorch
8. Pandas
9. Matplotlib
10. Beautiful Soup
11. Scrapy
12. Seaborn
13. PyCaret
14. OpenCV
15. Caffe

Lecture 9. Introduction to Apache Spark

Apache Spark – Installation


Spark is Hadoop’s sub-project. Therefore, it is better to install Spark into a Linux based system. The
following steps show how to install Apache Spark.
1. Verifying Java Installation
Java installation is one of the mandatory things in installing Spark. Try the following command to verify the
JAVA version.
$java –version
If Java is already, installed on your system, you get to see the following response –

java version "1.7.0_71"


Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

In case you do not have Java installed on your system, then Install Java before proceeding to next step.
2. Verifying Scala installation
You should Scala language to implement Spark. So let us verify Scala installation using following command.
$scala –version
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
In case you don’t have Scala installed on your system, then proceed to next step for Scala installation.
3. Downloading Scala
we are using scala-2.11.6 version. After downloading, you will find the Scala tar file in the download folder.
4. Installing Scala
Follow the below given steps for installing Scala.
Extract the Scala tar file type the following command for extracting the Scala tar file.
$ tar xvf scala-2.11.6.tgz
Move Scala software files use the following commands for moving the Scala software files, to respective
directory (/usr/local/scala).
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv scala-2.11.6 /usr/local/scala
# exit
Set PATH for Scala
Use the following command for setting PATH for Scala.
$ export PATH = $PATH:/usr/local/scala/bin
After installation, it is better to verify it. Use the following command for verifying Scala installation.
$scala –version
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
5. Downloading Apache Spark
we are using spark-1.3.1-bin-hadoop2.6 versions. After downloading it, you will find the Spark tar file in the
download folder.
6. Installing Spark
Follow the steps given below for installing Spark.
The following command for extracting the spark tar file.
$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz
The following commands for moving the Spark software files to respective directory (/usr/local/spark).
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit
Add the following line to ~/.bashrc file. It means adding the location, where the spark software file are
located to the PATH variable.
export PATH=$PATH:/usr/local/spark/bin
Use the following command for sourcing the ~/.bashrc file.
$ source ~/.bashrc
7. Verifying the Spark Installation
Write the following command for opening Spark shell.
$spark-shell
If spark is installed successfully then you will find the following output.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;
ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions:
Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.0
/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>

Spark Architecture
The Apache Spark base architecture diagram is provided in the following figure:

When the Driver Program in the Apache Spark architecture executes, it calls the real program of an
application and creates a SparkContext. SparkContext contains all of the basic functions. The Spark Driver
includes several other components, including a DAG Scheduler, Task Scheduler, Backend Scheduler, and
Block Manager, all of which are responsible for translating user-written code into jobs that are actually
executed on the cluster.
The Cluster Manager manages the execution of various jobs in the cluster. Spark Driver works in
conjunction with the Cluster Manager to control the execution of various other jobs. The cluster Manager
does the task of allocating resources for the job. Once the job has been broken down into smaller jobs,
which are then distributed to worker nodes, SparkDriver will control the execution.
Many worker nodes can be used to process an RDD created in the SparkContext, and the results can also
be cached.
The Spark Context receives task information from the Cluster Manager and enqueues it on worker nodes.
The executor is in charge of carrying out these duties. The lifespan of executors is the same as that of the
Spark Application. We can increase the number of workers if we want to improve the performance of the
system. In this way, we can divide jobs into more coherent parts.
Resilient Distributed Dataset?
A Resilient Distributed Dataset (RDD) is a low-level API and Spark's underlying data abstraction. An RDD is a
static set of items distributed across clusters to allow parallel processing. The data structure stores any
Python, Java, Scala, or user-created object.

You might also like