Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

Dr.

Ahmet Esad TOP


ahmetesadtop@aybu.edu.tr
o Formal definition by Tom M. Mitchell
o "A computer program is said to learn from experience E with respect to some class of tasks T and performance
measure P if its performance at tasks in T, as measured by P, improves with experience E."
o Learning from experiences is natural
o For humans or animals
o Humans collect information from events or observation of facts
o When a new event occurs and the result is unknown,
o The collected knowledge is used

o ML enables machines to learn from previous experiences


o ML techniques learn directly from the data itself
o No pre-determined equations or explicitly programmed decisions
o They essentially build the path to the answer using just the data
o ML algorithms find patterns or regularities in data
o It can make judgments, estimations, or take actions

o Data size is crucial for ML


o As the number of samples increase, performance is improved
Traditional Programming

Data
Computer Output
Program

Machine Learning

Data
Computer Program
Output
o ML is used when:
o Human expertise does not exist (navigating on Mars)
o Humans can’t explain their expertise (speech recognition)
o Models must be customized (personalized medicine)
o Models are based on huge amounts of data (genomics)

o Learning isn’t always useful:


o There is no need to “learn” to calculate payroll
o Examples: Network intrusion detection, e-mail filtering, speech recognition,
bioinformatics, and computer vision
o When developing explicit algorithms is challenging (or they fail)
o ML is employed
o ML divided into 2 categories
o supervised learning
o unsupervised learning
o Unsupervised learning has no labels
o without the corresponding output
o some patterns and relations can be found
o It learns underlying structure and distribution in the data (to model them)
o No correct answer or teacher (supervisor) available in this learning type
o After the similarities or differences has been revealed
o The data can be grouped up
o If grouped according to some rules → association solution
o association detects sets of items that frequently occur together in dataset
o If grouped according to inherent groupings in the data → clustering solution
o Clustering splits the dataset into groups according to similarities
o Common unsupervised learning algorithms are K-Means Clustering, Principal Component
Analysis (PCA), Hidden Markov Model, and Apriori algorithm
o Learns from labeled data
o Data must be composed of pairs
o It learns a generalized rule of mapping of the pairs
o from sample inputs to their desired outputs
o After training with samples
o it produces a function (mapping)
o new inputs to their unknown outputs
o intention is accurately discovering the labels
o success depends on the generalization capacity of the algorithm
o Y=f(X) → Y is output, X is input, and f() is the learned mapping function
o In supervised learning
o a supervisor assigns labels to data,
o then it is processed by one of the supervised learning algorithms
o to generate the desired function
oSupervised learning uses two techniques:
oclassification
o predicts discrete responses
o their outputs are categorical such as "black" or "white"
o E.g.: "yes" or "no"
oregression
o predicts continuous responses
o their outputs are real numbers such as "temperature"
othey are used to generate predictive models
o Given (x 1, y1), (x 2, y2), ..., (x n, yn)
o Learn a function f(x) to predict y given x
o y is real-valued == regression

9
September Arctic Sea Ice Extent

8
7
(1,000,000 sq km)

6
5
4
3
2
1
0
1970 1980 1990 2000 2010 2020
Year
o Given (x 1, y1), (x 2, y2), ..., (x n, yn)
o Learn a function f(x) to predict y given x
o y is categorical == classification

Breast Cancer (Malignant / Benign)

1(Malignant)

0(Benign)
Tumor Size
Predict Benign Predict Malignant

Tumor Size
Classification
2.1
⇒ 𝑔𝑜𝑜𝑑
1.8

Regression
2.1
⇒ .9
1.8
With a likelihood of 90%, this email is good
oOne of the most popular ML approaches is Artificial Neural Networks (ANN)
o also referred to as neural networks (NN)
oIt is an information processing system
oIt can be employed for both unsupervised or supervised learning
oThere is no pre-knowledge or set of programmed rules for the task expected
to be performed by the ANN before the training
oAn ANN simply takes input data (i.e., example data)
o and learns the ability to perform the required task
o by parsing the data and detecting patterns inside the data
o ANN learns (e.g., categorizing animals from images like ‘wolf’, ’giraffe’, or ‘dog’) by using
sample images
o it gets the image with a corresponding output (i.e., labels) as its training data
o they should be in the form of described pairs
o Initially, ANN uses the first sample image as its input
o feeds forward, and then receives the output of the first image
o According to the output, it measures the error
o analyzes how close it is to the intended result
o Then, it makes some adjustments to the weights
o by using the gradient descent algorithm
o Its weights are more accurately adjusted after several iterations
o using various samples
o An ANN is a network made up of several nodes
o each node communicates with linked nodes
o the receiving node process what it gets and sends the new info to next linked nodes
o Nodes are referred to as "artificial neurons" and are comparable to biological neurons
o Each node uses a nonlinear function to produce its output
o the function’s input is the sum of all the inputs of the node
o Edges, which are like "neurotransmitters", are the connections between nodes
o Each edge has its own weight, which is going to be updated after a backpropagation pass
o Layers are groups of neurons that are on the same level
o Each layer waits until the preceding layer has completed all of its computations
o In 1958, Rosenblatt invented the perceptron
o his single-layer perceptron was unable to solve the XOR issue
o until the backpropagation method was created in 1975
o One-layered perceptrons (excluding the input layer) can be used to solve "AND" and
"OR" gates
o but a one-layered perceptron cannot be used to create an "XOR" gate
o as a single line is not enough to split "XOR" in a Cartesian plane
o The structure and function of the human brain (i.e., biological neural networks and
neurons) serve as an inspiration for ANN
o Inputs are feature values
o Each feature has a weight
o Sum is the activation

w1
o If the activation is: f1


w2
o Positive, output +1 f2 >0?
o Negative, output -1 w3
f3
o "XOR" must be separated by using at least 2 lines
o at least a 2-layered perceptron network (excluding the input layer) is required
o 2 hidden layers, as layers between the input and the output are known as hidden layers
o This system is called as "Multilayer Perceptron (MLP)"
o Every node except the input layer uses a nonlinear activation function for its output
o e.g., sigmoid function or hyperbolic tangent
o MLPs employ the backpropagation approach for their training phase
o While training an ANN, the gradient needs to be calculated to update the weights
o after a forward pass
o this is done by backpropagation
o Gradient descent is employed
o it determines the gradient of the loss function
o The error is propagated to previous layers and neurons
o directly or indirectly connected neurons to the output neuron
o Each neuron’s net (i.e., incoming) values are calculated
o Each neuron’s out (i.e., outgoing) values are calculated
o The squared errors for the outputs are then determined
o The squared error cost function is minimized using
gradient descent
o weights are revised after each iteration, and this process
continues until the cost is as low as possible
o Gradient descent is guaranteed to converge to a
hypothesis with minimum squared error
o If the given learning rate is sufficiently small
o Gradient descent has the risk that it can over-step
the minimum in the error surface
o If the learning rate is too large
o Gradient descent may not find the global optimum
o If there are multiple local optima in the error surface
o Converging to a local optimum is sometimes slow
o -> To overcome the issues, variants of gradient
descent were developed
o batch gradient descent tends to overlap the global optima
o as it updates after seeing the whole data
o stochastic gradient descent tends to get stuck at the local
optima
o as it updates after seeing each sample
o Deep learning (DL) is a subset of ML where the learning procedure occurs in deeper
structures
o Deeper structures → the presence of several hidden layers
o Deep networks may contain tens or even hundreds of hidden layers
o traditional NNs only have one or two hidden layers
o DL eliminates the requirement for manual feature extraction unlike ML
o by converting the data into intermediate feature representations
o it extracts features first-hand from the data itself
o Another standout benefit of DL is its capability to continuously enhance its performance
o it keeps getting better performance as the size of the data increases
o improvement of technology → huge amount of data and powerful GPUs are available recently
o this situation has made DL so popular recently
o "Deep Learning" term came out to AI community in 1986 by Rina Dechter
o In 1989, Yann LeCun et al. developed a DNN that could read handwritten ZIP codes from mail using
the backpropagation approach
o ML approaches were more popular back in the day because of the high processing cost
of ANNs
o Later, advances in GPU technology became more significant
o making DL considerably more popular than other methods
o The "Big Bang of DL" occurred in 2009, when Nvidia trained DNNs on Nvidia GPUs
o NNs form the basis of the majority of all DL methods
o DNN is an ANN with multiple hidden layers
o As DNNs are feedforward networks, data travels
from the input layer to the output layer
o DNNs have a strong modeling capability since
they can sort out linear or non-linear
connections
o When modeling complex data, putting additional
layers in hidden layers may reduce the number
of units required in each hidden layer
o unlocks getting the combination of features from
previous layers
o In general, DNNs are very challenging to train
o CNN is a notable exception for training deep networks
o A 7-layered CNN called LeNet-5 was introduced in 1998 by LeCun et al. to recognize digits
from 32x32 photos.
o The resolution was limited to 32x32 due to the limited hardware capabilities available at the time
o CNNs attracted great attention after the computing industry acquired advanced
hardware capabilities.
o Several studies have utilized and demonstrated how to train CNNs on GPUs and their appropriate
approaches
o Nowadays, CNN is one of the most prominent and effective DNN types
o It is generally applied to computer vision such as image/video recognition
o The CNN is designed to minimize the need for pre-processing
o making it a very convenient choice for many applications
o CNN does not require manual feature engineering
o CNN also has the success of achieving state-of-the-art results
o can be re-trained for new data or tasks
o When people view a picture of a cat
o they can identify it based on its unique features
o such as its claws, four legs, tail, and whiskers
o Similarly, a CNN can classify a picture of a cat by processing the low-level features
o such as curves and edges
o and then creating more abstract concepts using multiple convolutional layers
o Traditional MLP architectures suffer from not scaling well to higher-resolution images
o Due to the "curse of dimensionality"
o a phenomenon
o the number of weights required by the model increases exponentially
o The reason behind this is full connectivity between nodes
o a fully connected layer of a 32x32 input image requires 1024 weights
o a fully connected layer of a 224x224 input image requires 50,176 weights
o However, a convolutional layer in a CNN can operate with a much smaller number of weights
o E.g., a 7x7 filter that convolves on a 32x32 or 224x224 image will always require only 49 learnable parameters
o regardless of the size of the input image
o This efficiency makes CNNs a more practical choice for image classification tasks
o CNNs are composed of layers with three-dimensional neurons
o each of which is connected to a small region of the previous layer known as the receptive field
o this structure allows CNNs to operate with fewer weights compared to traditional MLP architectures
o as the connections between neurons are more localized and not fully connected
o CNNs differ from ANNs in the types of operations performed by their hidden layers
o They include a combination of
o convolutional layers,
o pooling layers,
o a softmax layer,
o fully connected layers,
o and Rectified Linear Units (ReLUs)
o The convolutional layer is always the primary component of every CNN
o The other layers are inserted between convolutional layers
o The fully connected layers are placed at the end of the network
o These hidden layers serve to introduce non-linearity and maintain the dimensions of the input data
o It is characterized by a set of learnable filters
o These filters are used to scan the receptive field to search for matching patterns
o The filters are represented as rectangular arrays of numbers that serve as feature identifiers
o helping the CNN to identify and extract important features from the input data
o slide across the width and height of the input image
o ReLU is a type of activation function that is
widely used in deep learning models
o introduces non-linearity to the model
o generally follows each convolutional layer
o One of the main advantages of ReLU is its
simplicity
o its implementation is pretty straightforward and
does not require additional hyperparameters
o ReLU has improved the training speed of DNNs
o compared to other activation functions (e.g.,
sigmoid or hyperbolic tangent).
o ReLU also helps to alleviate the vanishing
gradient problem
o which occurs when the gradients of the weights
in the network become too small (i.e., training
becomes slow).
o Pooling layer is an essential component of CNN
architectures
o the majority of them are used right after the convolutional
layers
o Their duty is to reduce the spatial dimensions of the results
generated by the convolutional layers
o by merging the outputs of multiple neurons into a single neuron
that utilizes non-linear functions
o This simplification (i.e., downsampling) operation reduces
the number of parameters
o hence reduces computational overhead, as well as helping to
prevent overfitting
o Overfitting occurs when a model becomes too closely tuned
(i.e., close to ideal or fully ideal) to the training data but fails
on the test data (i.e., a generalization issue)
o pooling layers also help to maintain the spatial invariance
of the network
o i.e., can recognize an object regardless of its position in the
image
o Dropout is a handy regularization technique used to
reduce overfitting in neural networks
o It can be applied at different levels of the network
o This method consists of randomly dropping out certain
activations in a layer, with a probability commonly set at
0.5
o half of the hidden neurons are dropped out randomly
o once the training is completed, these neurons are recovered
with their weights
o This technique is beneficial for preventing overfitting
o It can also improve the model’s generalization
capabilities
o FC layers, also known as dense layers, are typically
placed at the end of a CNN
o Learns non-linear combinations of high-level feature
activations that have been extracted through a series
of convolutional and pooling layers
o FC layers are also used to map high-level feature
activations to the final output, making predictions or
classifications
o Combines and mixes important information from all
preceding convolutional layers
o The Softmax layer’s main duty is to perform multi-class classification
o It is typically placed at the end of a CNN as the last layer
o The softmax function (i.e., a probability distribution function) inspired the name of this layer
o through the softmax function, it produces probabilities of each class
o indicates the output class that the input is most likely to belong to
Thanks for your attention!

You might also like