Lec 07-08 - Final

AI in Telecommunications
(TC-5005)
Lecture # 07-08
Course Teacher : Dr Danish Mahmood Khan
Designation: Assistant Professor
NED University of Engineering & Technology
Source: https://www.wardsauto.com/vehicles/artificial- Source: https://onpassive.com/blog/what-is-the-impact-of-artificial-intelligence-on-

intelligence-doing-more-increase-driver-safety Source: https://engineering.fb.com/2016/12/01/ml-applications/artificial-intelligence-revealed/ the-telecom-industry/
Regression Revisit with Common Notations
𝑦𝑦� = 𝑐𝑐 + 𝑚𝑚𝑚𝑚 ℎ𝜃𝜃 (𝑥𝑥) = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1

ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2 + ⋯ + 𝜃𝜃𝑛𝑛 𝑥𝑥𝑛𝑛
(𝑖𝑖)
(ℎ𝜃𝜃 𝑥𝑥 − 𝑦𝑦 𝑖𝑖 )2
Cost Function
In machine learning, a cost function, also known as a loss function or objective function, is a measure
of how well a model's predictions match the true values or labels of the training data. The goal of a
machine learning algorithm is often to minimize this cost function, as doing so implies that the model
is making accurate predictions.
The choice of a specific cost function depends on the type of problem being solved, such as
regression or classification.
 Mean Squared Error (MSE): Used in regression problems.

 Cross-Entropy Loss (Log Loss): Commonly used in classification problems.
Regression Cost Function
𝑛𝑛
1 (𝑖𝑖)
𝐽𝐽 𝜃𝜃0 , 𝜃𝜃1 = �(ℎ𝜃𝜃 𝑥𝑥 − 𝑦𝑦 𝑖𝑖 )2
2𝑛𝑛
𝑖𝑖=1
𝐿𝐿𝐿𝐿𝐿𝐿, 𝜃𝜃0 = 0
𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫 ∶ 𝒙𝒙, 𝒚𝒚 = { 𝟏𝟏, 𝟏𝟏 , 𝟐𝟐, 𝟐𝟐 , 𝟑𝟑, 𝟑𝟑 } ℎ𝜃𝜃 (𝑥𝑥) = 0 + 𝜃𝜃1 𝑥𝑥1
𝐿𝐿𝐿𝐿𝐿𝐿, 𝜃𝜃1 = 1 𝑛𝑛
1 (𝑖𝑖)
𝐽𝐽 𝜃𝜃1 = �(ℎ𝜃𝜃 𝑥𝑥 − 𝑦𝑦 𝑖𝑖 )2
2𝑛𝑛
𝑖𝑖=1
3
1 (𝑖𝑖)
𝐽𝐽 𝜃𝜃1 = �(ℎ𝜃𝜃 𝑥𝑥 − 𝑦𝑦 𝑖𝑖 )2
2(3)
𝑖𝑖=1
3
1 (𝑖𝑖)
𝐽𝐽 𝜃𝜃1 = 1 = �(ℎ𝜃𝜃 𝑥𝑥 − 𝑦𝑦 𝑖𝑖 )2
6
𝑖𝑖=1
𝐽𝐽 𝜃𝜃1 = 1 = 0
ℎ𝜃𝜃 𝑥𝑥 = 0 + 1 ∗ 1 = 1
ℎ𝜃𝜃 𝑥𝑥 = 0 + 1 ∗ 2 = 2
ℎ𝜃𝜃 𝑥𝑥 = 0 + 1 ∗ 3 = 3
𝐿𝐿𝐿𝐿𝐿𝐿, 𝜃𝜃1 = 0.5 1
3
(𝑖𝑖)
𝐽𝐽 𝜃𝜃1 = 0.5 = �(ℎ𝜃𝜃 𝑥𝑥 − 𝑦𝑦 𝑖𝑖 )2
6
𝑖𝑖=1
1 2 2
𝐽𝐽 𝜃𝜃1 = 0.5 = { 0.5 − 1 + 1−2 + 1.5 − 3 2 }
6
𝐽𝐽 𝜃𝜃1 = 0.5 = 0.583
ℎ𝜃𝜃 𝑥𝑥 = 0 + 0.5 ∗ 1 = 0.5

ℎ𝜃𝜃 𝑥𝑥 = 0 + 0.5 ∗ 2 = 1
ℎ𝜃𝜃 𝑥𝑥 = 0 + 0.5 ∗ 3 = 1.5
𝐿𝐿𝐿𝐿𝐿𝐿, 𝜃𝜃1 = 1.5 1
3
(𝑖𝑖)
𝐽𝐽 𝜃𝜃1 = 0.5 = �(ℎ𝜃𝜃 𝑥𝑥 − 𝑦𝑦 𝑖𝑖 )2
6
𝑖𝑖=1
1 2 2
𝐽𝐽 𝜃𝜃1 = 0.5 = { 1.5 − 1 + 3−2 + 4.5 − 3 2 }
6
3.5
𝐽𝐽 𝜃𝜃1 = 1.5 = = 0.583
6
ℎ𝜃𝜃 𝑥𝑥 = 0 + 1.5 ∗ 1 = 1.5

ℎ𝜃𝜃 𝑥𝑥 = 0 + 1.5 ∗ 2 = 3
ℎ𝜃𝜃 𝑥𝑥 = 0 + 1.5 ∗ 3 = 4.5
𝐽𝐽 𝜃𝜃1 = 1 = 0
𝐽𝐽 𝜃𝜃1 = 0.5 = 0.583
𝐽𝐽 𝜃𝜃1 = 1.5 = 0.583
𝐽𝐽 𝜃𝜃1
3
2.5 Loss Function
2 Curve
1.5
1
0.5
0
1 2 3 𝜃𝜃1
A global minimum (or global minimum point) refers
to the lowest possible value of the objective function
across the entire feasible space. The feasible space is
the set of all possible values for the parameters or
variables that satisfy any constraints imposed by the
problem.
Need some optimization algorithm to find this

global minima.
In mathematical terms, let's denote the objective function as 𝑓𝑓 𝑥𝑥 where 𝑥𝑥 represents the set of
parameters or variables. A global minimum occurs at a point 𝑥𝑥 ∗ if:
𝒇𝒇 𝒙𝒙∗ ≤ 𝒇𝒇(𝒙𝒙)
for all possible 𝑥𝑥 within the feasible space.

Optimization Algorithms
In machine learning and data science, optimization algorithms are commonly used to find the optimal
set of parameters for a model that minimizes a certain cost or objective function.
The goal of optimization algorithms is to iteratively adjust the parameters of a model to reduce the
value of a cost function, indicating that the model's predictions are improving and becoming more
accurate. The process involves moving through the solution space in search of the minimum or
maximum of the objective function.
The process is considered converged when the changes in the objective function or parameters fall
below a certain threshold, indicating that further iterations are unlikely to significantly improve the
solution.
Common Optimization Algorithms
 Gradient Descent
 Stochastic Gradient Descent (SGD)
 Mini-Batch Gradient Descent
 Momentum
 Adagrad (Adaptive Gradient Algorithm)
 RMSprop (Root Mean Square Propagation)
 Adam (Adaptive Moment Estimation)
 Nadam (Nesterov-accelerated Adaptive Moment Estimation)
Choosing the right optimization method depends on factors such as the nature of the data, the
characteristics of the loss landscape, and the computational resources available. Experimentation and
tuning are often necessary to find the most effective optimization strategy for a specific machine
learning task.
Gradient Descent
Gradient Descent is an iterative optimization algorithm used for minimizing the cost or loss function
in machine learning models. The primary goal of this algorithm is to find the minimum of a function,
typically the loss function, by iteratively moving towards the steepest or most rapid decrease in the
function.
𝐽𝐽 𝜃𝜃1
 Initialize 𝜃𝜃1 randomly.
 Calculate the derivative in order to find
3
the slope. This will help in updating the
2.5 value of 𝜃𝜃1 .
2  This slope could be +ve or –ve.
1.5  Find updated 𝜃𝜃1 using :
1 Learning
0.5
rate 𝜕𝜕
0 𝜃𝜃𝑗𝑗 ≔ 𝜃𝜃𝑗𝑗 − ∝ {𝐽𝐽 𝜃𝜃1 }
1 𝜃𝜃1 𝜕𝜕𝜃𝜃𝑗𝑗
2 3
Convergence Theorem Slope
Finding Global Minima
 Repeat Convergence Theorem.

𝐽𝐽 𝜃𝜃1
𝜕𝜕
𝜃𝜃𝑗𝑗 ≔ 𝜃𝜃𝑗𝑗 − ∝ {𝐽𝐽 𝜃𝜃1 }
3 𝜕𝜕𝜃𝜃𝑗𝑗
2.5
2
1.5
1
0.5
0
1 2 3 𝜃𝜃1
Learning Rate
The learning rate is a hyperparameter in machine learning optimization algorithms that determines the
size of the steps taken during the optimization process. It is a crucial parameter because it influences
how quickly or slowly a machine learning model learns.
Small value of learning rate Large value of learning rate

Loss Function Curve for 𝐽𝐽(𝜃𝜃0 , 𝜃𝜃1 )
𝐽𝐽(𝜃𝜃0 , 𝜃𝜃1 )
𝜃𝜃1 𝜃𝜃0
Overfitting & Underfitting
Underfitting and overfitting are two common issues in machine learning that relate to how well a
model generalizes to new, unseen data. These issues arise during the training process when finding a
balance between simplicity and complexity in the model.
Underfitting
Underfitting occurs when a model is too simple to capture the underlying patterns in
the training data. As a result, it performs poorly not only on the training data but also
on new, unseen data.
Identification Causes
 High training error.  Using a too simple model (e.g., linear regression on a highly
 High testing error. non-linear problem).
 Fails to capture the complexities and  Insufficient training (too few iterations or too little data).
pattern in the data.  Lack of relevant features.
Possible Solution
 Increase model complexity (e.g., use a more complex algorithm or increase the degree of a polynomial).
 Add more relevant features.
 Train for more epochs or with more data.
Overfitting
Overfitting occurs when a model is too complex and fits the training data too closely, capturing
noise and random fluctuations. While it may perform well on the training data, it fails to generalize
to new data.
Identification Causes
 Low training error.  Too complex model (e.g., using too many features or high-
 High testing error. degree polynomials).
 Model memorizes the training data instead of  Training for too many epochs.
learning the underlying patterns  Limited amount of training data.
Possible Solution
 Reduce model complexity (e.g., use fewer features, decrease polynomial degree).
 Regularization techniques (e.g., L1 or L2 regularization).
 Increase the amount of training data.
 Early stopping during training.
Ridge Regression (L2 Regularization)
Ridge Regression, also known as Tikhonov regularization or L2 regularization, is a linear
regression technique that introduces a regularization term to the cost function. It is designed to
address the issue of multicollinearity (high correlation among predictor variables) and prevent
overfitting in linear regression models.
𝑛𝑛
1 (𝑖𝑖)
𝐽𝐽 𝜃𝜃1 = �(ℎ𝜃𝜃1 𝑥𝑥 − 𝑦𝑦 𝑖𝑖 )2 + 𝜆𝜆(𝜃𝜃1 )2
2𝑛𝑛
𝑖𝑖=1
𝐽𝐽 𝜃𝜃 = 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 + 𝜆𝜆 ∑𝑛𝑛𝑖𝑖=1 𝜃𝜃𝑖𝑖2
 𝜆𝜆 (lambda) is the regularization parameter, a non-negative hyperparameter that controls the strength
of the regularization.
 Ridge Regression is effective when dealing with multicollinearity, where predictor variables are
highly correlated.
 It helps prevent overfitting by penalizing large coefficients.
Lasso Regression (L1 Regularization)
Lasso Regression, or L1 regularization, is a linear regression technique that adds a regularization
term to the cost function to prevent overfitting and encourage the model to use fewer features. The
term "Lasso" stands for Least Absolute Shrinkage and Selection Operator. Lasso Regression is
particularly useful when dealing with datasets where many features may not contribute
significantly to the prediction.
𝐽𝐽 𝜃𝜃 = 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 + 𝜆𝜆 ∑𝑛𝑛𝑖𝑖=1 𝜃𝜃𝑖𝑖
 Lasso tends to shrink some coefficients to zero, effectively removing them from the model.
 The remaining non-zero coefficients are believed to be the most important predictors.
 This property makes Lasso useful for feature selection, as it can automatically eliminate less
important features.
Elastic Net
Elastic Net is a linear regression technique that combines both L1 regularization (Lasso) and L2
regularization (Ridge) in its objective function. It is designed to address some of the limitations of
Lasso and Ridge Regression and provides a balance between feature selection and coefficient
shrinkage. Elastic Net introduces two hyperparameters, λ and α, where λ controls the overall
strength of the regularization, and α determines the mix between L1 and L2 regularization.
𝑛𝑛 𝑛𝑛
1
𝐽𝐽 𝜃𝜃 = 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 + 𝜆𝜆 (α � 𝜃𝜃𝑖𝑖 + (1 − 𝛼𝛼) � 𝜃𝜃𝑖𝑖2
2
𝑖𝑖=1 𝑖𝑖=1
 Elastic Net combines the sparsity-inducing property of Lasso (L1) with the Ridge (L2) penalty for
stability.
 The hyperparameter α allows for adjusting the mix between L1 and L2 regularization, providing
flexibility in addressing specific modeling needs.
 Similar to Lasso, Elastic Net tends to shrink some coefficients to exactly zero, effectively performing
feature selection.
Grid Search
GridSearch is a hyperparameter tuning technique used in machine learning to find the best set of
hyperparameters for a model. Hyperparameters are configuration settings for a model that are not
learned from the data but must be specified before training. Examples include learning rate, the number
of hidden layers in a neural network, or the depth of a decision tree.
The process of hyperparameter tuning involves searching through a predefined set of hyperparameter
values to find the combination that results in the best model performance. GridSearch is a systematic
and exhaustive search strategy where you specify a grid of hyperparameter values, and the algorithm
evaluates the performance of the model for each combination of these values.
LR as Classifier
y
1
0.5
0
x
LR as Classifier
y
1
0.5
0
x
LR as Classifier
y
1
0.5  Outlier Issue

 Estimated value could be –
ve as well as greater than 1
0
x
LR as Classifier
 Need Squashing
ℎ𝜃𝜃 (𝑥𝑥) = 𝑔𝑔(𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 )
y
1
0.
5
0 x
ℎ𝜃𝜃 (𝑥𝑥) = 𝑔𝑔(𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥)
 We use Sigmoid function as g here 𝜃𝜃 𝑇𝑇 𝑥𝑥
1 𝑧𝑧 = (𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥)
𝑔𝑔 =
1 + 𝑒𝑒 −𝑧𝑧
1 Logistic Regression Equation
ℎ𝜃𝜃 (𝑥𝑥) =
1 + 𝑒𝑒 −(𝜃𝜃0 +𝜃𝜃1 𝑥𝑥1 )
The sigmoid function, also known as the logistic function, is a mathematical function commonly
used in machine learning and statistics. It's a type of activation function that maps any real-valued
number to a value between 0 and 1.
The sigmoid function has an S-shaped curve, which is why it is often called a sigmoid curve. This
curve is useful in binary classification problems, where the goal is to classify an input into one of
two categories. The output of the sigmoid function can be interpreted as a probability, with values
closer to 1 indicating a higher probability of belonging to one category, and values closer to 0
indicating a higher probability of belonging to the other category.
Cost Function
𝑛𝑛 1
1 (𝑖𝑖) ℎ𝜃𝜃 (𝑥𝑥) =
𝐽𝐽 𝜃𝜃0 , 𝜃𝜃1 = �(ℎ𝜃𝜃 𝑥𝑥 − 𝑦𝑦 𝑖𝑖 )2 1 + 𝑒𝑒 −(𝜃𝜃0 +𝜃𝜃1 𝑥𝑥1 )
2𝑛𝑛
𝑖𝑖=1
Convex Non-Convex
Non Convex Cost Function
 P1 = ?
 P2 = ?
 P3 = ?
Issues
 Local Minima and Saddle Points: Non-convex functions can have multiple local minima, making it
challenging for optimization algorithms to find the global minimum. Additionally, saddle points, where
the gradient is zero, but the point is not an optimum, can mislead optimization algorithms.
 Stuck in Local Optima: Optimization algorithms may get stuck in a local minimum, failing to reach
the global minimum. This can result in suboptimal model parameters and reduced model performance.
 Sensitive to Initialization: Non-convex loss functions can be sensitive to the initial values of the
parameters. Different initializations may lead to different local minima, and finding a good starting
point becomes crucial.
 Limited Theoretical Guarantees: Convex optimization problems have well-established theoretical

guarantees, such as global optimality and convergence. Non-convex optimization lacks such guarantees,
making it more challenging to reason about the behavior of optimization algorithms.
 Computational Complexity: Optimizing non-convex functions is computationally more expensive

than optimizing convex ones. Non-convex optimization often requires more sophisticated optimization
techniques and longer training times.
Regression Cost Function
𝑛𝑛
1 𝑖𝑖 𝑖𝑖 𝑖𝑖 𝑖𝑖
𝐽𝐽 𝜃𝜃0 , 𝜃𝜃1 = − �[𝑦𝑦 log ℎ𝜃𝜃 𝑥𝑥 + 1 − 𝑦𝑦 log 1 − ℎ𝜃𝜃 𝑥𝑥 ]
2𝑛𝑛
𝑖𝑖=1
In the field of telecommunications, the efficient management of Radio Access Networks (RANs) is vital to ensure optimal
service quality. One critical challenge is predicting high load scenarios in base stations, as an unexpected surge in user
activity can lead to network congestion and degraded performance. consider a scenario in the management of RAN where we
want to predict the likelihood of a cell site experiencing high traffic congestion based on various independent variables. In
this case, the independent variables could include factors such as the number of connected devices, the bandwidth of the
channel, and the signal strength.
No of Connected Bandwidth Signal High
Devices (X1) (X2) Strength (X3) Congestion (Y)
50 20 -75 0
75 15 -80 1
60 25 -70 0
90 18 -85 1
80 22 -78 1
If the intercept is -3, while regression coefficients for No. of connected devices, bandwidth and signal strength
are 0.1,0.05,−0.02, find the likelihood of RAN been suffering from high congestion when:
No of Connected Devices (X1) Bandwidth (X2) Signal Strength (X3)
60 18 -78
70 22 -82
80 22 -78

Lec 07-08 - Final

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 07-08 - Final

Uploaded by

Copyright:

Available Formats

AI in Telecommunications

Source: https://www.wardsauto.com/vehicles/artificial- Source: https://onpassive.com/blog/what-is-the-impact-of-artificial-intelligence-on-

𝑦𝑦� = 𝑐𝑐 + 𝑚𝑚𝑚𝑚 ℎ𝜃𝜃 (𝑥𝑥) = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1

 Mean Squared Error (MSE): Used in regression problems.

𝐽𝐽 𝜃𝜃1 = 0.5 = 0.583

ℎ𝜃𝜃 𝑥𝑥 = 0 + 0.5 ∗ 1 = 0.5

ℎ𝜃𝜃 𝑥𝑥 = 0 + 1.5 ∗ 1 = 1.5

Need some optimization algorithm to find this

for all possible 𝑥𝑥 within the feasible space.

 Repeat Convergence Theorem.

Small value of learning rate Large value of learning rate

𝐽𝐽 𝜃𝜃 = 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 + 𝜆𝜆 ∑𝑛𝑛𝑖𝑖=1 𝜃𝜃𝑖𝑖2

𝐽𝐽 𝜃𝜃 = 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 + 𝜆𝜆 ∑𝑛𝑛𝑖𝑖=1 𝜃𝜃𝑖𝑖

0.5  Outlier Issue

ℎ𝜃𝜃 (𝑥𝑥) = 𝑔𝑔(𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 )

 Limited Theoretical Guarantees: Convex optimization problems have well-established theoretical

 Computational Complexity: Optimizing non-convex functions is computationally more expensive

You might also like