AI Foundations and Applications: 8. Optimization of Learning Process

AI Foundations and Applications
8. Optimization of learning process
Thien Huynh-The
HCM City Univ. Technology and Education
Jan, 2023
04/05/2023 HCMUTE 1
Course Contents
1. Introduction (1 week)
2. Intelligent Agents (1 week)
3. Machine Learning: Fundamental Concepts (3 weeks)
1. Feature Engineering
2. Linear Regression
3. Overfitting
4. Machine Learning Algorithms (3 weeks)
1. Decision Tree Note: This is the tentative contents and may
2. K-nearest neighbors be changed depending upon the teaching
3. K-means clustering progress.
5. Neural Networks (2 weeks) Note: There are no final exam of this course,
6. Deep Learning (4 weeks) therefore it has one reserved week for
marking
1. Convolutional Neural Networks
2. Applications with codes
HCMUTE AI Foundations and Applications 04/05/2023 2

Course Information
• The side will be delivered to students:

• Check website often (https://sites.google.com/site/thienhuynhthe/teaching)
• As recommend to print as hard copy
• Assignments and announcements will be posted on the website
• Requirements:
• PC or laptop (in-class coding with python or matlab)
• Course work
• Attendence: 10%
• In-class assigments: 30%
• Around 6 sudden assignments (5% of each)
• Students have prepared A4 papers (not allow others else)
• Homework programming assigment: 20% (individual)
• Final programming project: 40% (2 students per group)
• Please note: I do not accept late assigment submission unless any reasons.

Course Information (cont.)
• Homework assignment:
• Send code to teacher by email before deadline
• Over deadline is zero (no negotiate)
• Code with detailed comments (to rate how much students understand their codes)
• Run codes and record the laptop screen to demonstrate your codes (even it may be error)
• Final project:
• Send code to teacher by email before deadline
• Over deadline is zero (no negotiate)
• Code with detailed comments (to rate how much students understand their codes)
• Run codes directly on student’s laptop and show to teacher
• Answer question from teacher individually

Plagiarism
• All assignments are to be done individually.

• You can discuss the assignments with other students, but you should not give
your code (or parts of your code) to other students. You should not look at
another student’s code until after you have handed in your assignment (and
the due date is past).
• Plagiarism has occurred in the past in this class and it has had very negative
consequences for the students involved.
• Because 60% of the course mark is based on handed in work, I will be very
diligent about detecting plagiarism.
• Email policy:
• Do not answer questions about course content and queries about your marks.
• Do not need to ask me when you want to absent oneself from school.

Challenges in Deep Learning
Local minima
• The objective function of deep learning usually has many local minima
• The numerical solution obtained by the final iteration may only minimize the
objective function locally, rather than globally.
• As the gradient of the objective function's solutions approaches or becomes zero

Vanishing gradient
• As more layers using certain activation functions are added
to neural networks, the gradients of the loss function
approaches zero, making the network hard to train
• The simplest solution is to use other activation functions,
such as ReLU, which doesn’t cause a small derivative.
• Residual networks are another solution, as they provide
residual connections straight to earlier layers
Exploding gradient
• On the contrary, in some cases, the gradients keep on
getting larger and larger as the backpropagation algorithm
progresses. This, in turn, causes very large weight updates
and causes the gradient descent to diverge. This is known
as the exploding gradients problem.


How to choose the right Activation Function?
Few other guidelines to help you out.
• ReLU activation function should only be used in the hidden
layers.
• Sigmoid/Logistic and Tanh functions should not be used in
hidden layers as they make the model more susceptible to
problems during training (due to vanishing gradients).
• Swish function is used in neural networks having a depth
greater than 40 layers.
Choose the activation function for your output layer
based on the type of prediction problem that you are
solving:
• Regression - Linear Activation Function
• Binary Classification—Sigmoid/Logistic Activation Function
• Multiclass Classification—Softmax
• Multilabel Classification—Sigmoid
The activation function used in hidden layers is typically
chosen based on the type of neural network
architecture.
• Convolutional Neural Network (CNN): ReLU activation
function.
• Recurrent Neural Network: Tanh and/or Sigmoid activation
function.

Over fitting and under fitting

• Overfitting is a modeling error in statistics that occurs when a function is too closely aligned to
a limited set of data points. As a result, the model is useful in reference only to its initial data
set, and not to any other data sets
• The model fit well the training data, but it not show the good performance with the testing data
• Underfitting is a scenario in data science where a data model is unable to capture the
relationship between the input and output variables accurately, generating a high error rate on
both the training set and unseen data

Optimization schemes - Momentum
• The method of momentum is designed to accelerate learning.

• The momentum algorithm accumulates exponentially decaying moving
average of past gradient and continues to move in their direction
Gradient descent with 2 variables Gradient descent with 2 variables (another example)
• Learning rate 0.4

• Learning rate 0.6

Optimization schemes - Momentum
• Instead of using only the gradient of the current step to guide the search,
momentum also accumulates the gradient of the past steps to determine the
direction to go.
• The equations of gradient descent are revised as follows.
AI Foundations and Applications 12

Adaptive Gradient Descent (Adagrad)
• Decay the learning rate for parameters in proportion to their update history
• Adapts the learning rate to the parameters, performing smaller updates (low learning rates) for
parameters associated with frequently occurring features, and larger updates (high learning
rates) for parameters associated with infrequent features
• It is well-suited for dealing with sparse data
• Adagrad greatly improved the robustness of SGD and used it for training large-scale neural
nets

Root Mean Squared Propagation (RMSProp)
• Adapts the learning rate to the parameters

• Divide the learning rate for a weight by a running average of the magnitudes of
recent gradients for that weight

Adaptive Moment Estimation (Adam)
• ADAM combines two stochastic gradient descent approaches, Adaptive

Gradients, and Root Mean Square Propagation
• Adam also keeps an exponentially decaying average of past gradients similar
to SGD with momentum

Dropout
• Avoid overfitting problem

• Probabilistically dropping out nodes in the network is a simple and effective
regularization method
• Dropout is implemented per-layer in a neural network
• A common value is a probability of 0.5 for retaining the output of each node in
a hidden layer

Dropout

Assignment 2 (mandatory)
Assignment 2 (mandatory)
Design a multilayer neural networks with input layer, 02 hidden layers (sigmoid) , output
layer (softmax). Apply the optimization methods of Momentum và Adam. Compare the
accuracy and Converging time among two methods. Assume that the MNIST dataset is
used for training and testing the neural network.
Submission by email (thienht@hcmute.edu.vn) the Python code files:
• Momentum.py
• Adam.py
• Deadline: 24/04/2023.
• Take 10% of overall subject score.

AI Foundations and Applications: 8. Optimization of Learning Process

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AI Foundations and Applications: 8. Optimization of Learning Process

Uploaded by

Copyright:

Available Formats

AI Foundations and Applications

8. Optimization of learning process

HCMUTE AI Foundations and Applications 04/05/2023 2

• The side will be delivered to students:

HCMUTE AI Foundations and Applications 04/05/2023 3

HCMUTE AI Foundations and Applications 04/05/2023 4

• All assignments are to be done individually.

HCMUTE AI Foundations and Applications 04/05/2023 5

HCMUTE AI Foundations and Applications 04/05/2023 6

HCMUTE AI Foundations and Applications 04/05/2023 7

HCMUTE AI Foundations and Applications 04/05/2023 8

HCMUTE AI Foundations and Applications 04/05/2023 9

Over fitting and under fitting

HCMUTE AI Foundations and Applications 04/05/2023 10

• The method of momentum is designed to accelerate learning.

• Learning rate 0.4

HCMUTE AI Foundations and Applications 04/05/2023 11

AI Foundations and Applications 12

HCMUTE AI Foundations and Applications 04/05/2023 13

• Adapts the learning rate to the parameters

HCMUTE AI Foundations and Applications 04/05/2023 14

• ADAM combines two stochastic gradient descent approaches, Adaptive

HCMUTE AI Foundations and Applications 04/05/2023 15

• Avoid overfitting problem

HCMUTE AI Foundations and Applications 04/05/2023 16

HCMUTE AI Foundations and Applications 04/05/2023 17

HCMUTE AI Foundations and Applications 04/05/2023 18

You might also like