Ac-Ed l05 - SMVS, Kernels

Machine Learning
Support Vector Machine, Kernels

Ramon Baldrich
Universitat Autònoma de Barcelona
Credit to: Dimos Karatzas, A. Ng, A. Karpathy, T. Mitchell

Another ML introduction:
A pictorial introduction
SUPPORT VECTOR MACHINES

Classification
x2
x1
Classification
x2 Generative models: try
model the class, e.g. estimate
the PDF that gave rise to
these points.
We learn: p(x | y)
x1
Classification
x2 Generative models: try
model the class, e.g. estimate
the PDF that gave rise to
these points. Datos Etiqueta
We learn: p(x | y)
Discriminative Models: use

the data to find the best way
to separate the classes.
We learn: p(y | x)
x1
Linearly Separable Classes
x2
θT x  0 →
θT x = 0
θT x  0
θT x  0 →
Decision
boundary
θT x  0 θ =  0 ,1 ,..., n 
T
x1 x = [ x0 = 1, x1 ,..., xn ]T
Linearly Separable Classes
x2 There are multiple valid solutions
(hyper-planes that separate the
two classes).
Which one shall we choose?
x1
Support Vector Machines
x2 Linear classifiers: the
decision boundary is an
hyper-plane
x1
hyper-plane
Maximal margin: the solution

maximises the margin (empty
space) between the classes
Buscar un hiperplano con la separación entre

las dos clases mas grande posible
x1
hyper-plane
Maximal margin: the solution

maximises the margin (empty
space) between the classes
Support vectors: only a few

data points are needed to
define this margin. They are
called support vectors
x1
x2 Every other decision
boundary, has a smaller
margin
x1
x2 If the support vectors move,
the decision boundary
changes
x1
x2 If the support vectors move,
changes
x1
x2 If the rest of the points move,
remains stable
x1
x2 If the rest of the points move,
remains stable
x1
Outliers - Soft Margin
x2 Outliers (if they happen to be
support vectors) might have a
big effect to the solution
Es mas robusto ante outliers
x1
Pero aun así, pueden llegar a ser un problema
In this case, we can relax

the maximum margin
condition (soft margin)
This implies adding a

tolerance to errors,
controlled by what is
called slack variables
x1
Lo que haremos es suavizar
añadiendo una tolerancia
Non-Linearly Separable Classes –
Y si mi espacio es no
The Kernel Trick lineal?
Transformamos el espacio a otro

donde se puedan separar las If classes are not linearly
clases
separable, SVM will not work
1D
The solution is to transform
the feature space into
2D another one (of higher
dimension), in which our
classes are linearly separable
It is not necessary to define

this transformation explicitly.
It is only necessary to define
a similarity function that
calculates the dot-product in
this new space. This is called
a kernel
INTUITION AND OPTIMISATION
OBJECTIVE
A Line – Just in Case
Conjunto de parametros
θT x +  0 = 0
In the rest of the discussion we will

 assume that θ0 is equal to zero for
θ simplicity, this means that the
hyperplane crosses the origin
θ0
Intuition
Support We want to find the hyperplane that
vectors maximises the distance between the
closest vectors (of both classes) to the
hyperplane and the hyperplane itself
d- d+ Hyperplane is defined by
vector: θ
d+ : distance from
 hyperplane to support
θ vectors of class (+)
d- : distance from
hyperplane to support
vectors of class (-)
d + = d- = D
A bit of Geometry
Given a hyperplane θ the distance of a
point x to the hyperplane is given by:

D = x cos( )
D
 The definition of the dot product
θ  between two vectors x and θ is:
    
x
θ  x = θ x =| θ || x | cos( )
T
Distance can be written:

 
θx 
D =  =| x | cos( )
|θ|
A bit of Geometry
Podemos definir un hiperplano de muchas
 formas
37 θ The same hyperplane can be defined by

 a whole family of vectors (same
D 2θ direction, different lengths), and the
 distance calculation would be the same:
θ 

x
  
θ  2θ  400 θ 
D =  x =  x =  x
|θ| | 2θ | | 400 θ |
A bit of Geometry
When searching for the best hyperplane, we want to
pick a single representative between all equivalent
θT x +  0 = 1 representations for a given hyperplane (the family of
vectors defining the same hyperplane)
θT x +  0 = 0 D
 The convention for SVM is to pick the vector θ
D θ that makes (we introduce θ0 back):
θT x +  0 = 1 for all support ve ctors of class( + )
θ T x +  0 = −1 for all support ve ctors of class(-)
Therefore the distance we are

trying to maximise is given by:
 
θ  x + 0 1
θT x +  0 = −1 D=  = 
|θ| |θ|
Maximizar D = Minimizar O
Which is equivalent to minimising |θ|
Optimisation Function
Maximise
θ x + 0 = 0
T
margin
θT x +  0 = 1
Minimise |θ|:
θT x +  0  1 1 2
min θ  min θ
2
Subject to:
θ x +  0 = −1
T yi (θT xi +  0 )  1
y =1 Classify
correctly
y = −1
θT x +  0  −1
This is a quadratic
optimisation problem
Cost Functions
θT x +  0 = 0 T
θ x + 0 = 1 This means setting the costs so that
θT x +  0  1 we do not penalise (cost = zero)
anything classified correctly, but we
θT x +  0 = −1 do penalise if things fall on the
wrong side
y =1
θT x +  0  −1 y = −1
Costy=1 Costy=-1
-1 1 θTx+θ0 -1 1 θTx+θ0
COMPARISON TO LOGISTIC
REGRESSION
Logistic Regression Reminder
hθ ( x ) =
1
−θ T x
1+ e
y =1
hθ (x ) = g (z)
y=0
z = θT x
If y=1, we want hθ≈1, θTx >> 0
If y=0, we want hθ≈0, θTx << 0
Simplified cost function
1 m
( ( )
J ( ) =  Cost h x (i ) , y (i )
m i =1
)
 − log (h (x )) , if y = 1
Cost (h (x ), y ) = 
− log (1 − h ( x )) , if y = 0
(Note that y is always either 0 or 1)
Cost (h ( x), y ) = − y log (h ( x)) − (1 − y )log (1 − h ( x))
If y = 1 : Cost (h ( x), y ) = − log (h ( x))

If y = 0 : Cost (h ( x), y ) = − log (1 − h ( x))
The Cost of a Sample
Cost of sample = − y log (h ( x) ) − (1 − y ) log (1 − h ( x) )
 1  0  1 
= − y log  −θ T x
 − (1 − y ) log 1 − −θ T x

1+ e   1+ e 
If y=1, we want θTx >> 0
Cost
 1 
− log  −θ T x

1+ e 
z = θTx
The Cost of a Sample
Cost of sample = − y log (h ( x) ) − (1 − y ) log (1 − h ( x) )
0    1 
 − (1 − y ) log 1 −
1
= − y log  −θ T x −θ T x

1+ e   1+ e 
If y=1, we want θTx >> 0 If y=0, we want θTx << 0
Cost Cost
 1   1 
 − log 1 − 
1+ e   1+ e −θ T x

z = θTx z = θTx
Towards SVM
Cost of sample = − y cost1 (θT x) − (1 − y ) cost 0 (θT x)
If y=1, we want θTx ≥ 1 If y=0, we want θTx ≤ -1

Cost Cost
 1   1 
 − log 1 − 
1+ e   1+ e −θ T x

( )
cost1 θT x ( )
cost 0 θT x
z = θTx z = θTx
Optimization Objective
Logistic Regression:

1 m (i )
 ( ( )) ( )( ( ( )))+ 2m 
n
min  y − log h x (i ) + 1 − y (i ) − log 1 − h x (i ) 2
j
θ m
i =1 j =1
SVM:
 n 2
θ
1 m (i )
m i =1
 T
( ) (
min  y cost 1 θ x + 1 − y cost 0 θ x +
(i ) T
) 
2m j =1
j ( )
By convention: we do not divide by m, and we use a weight C (equivalent
to 1/λ) to control the relative weight of the regularization factor
 ( ) ( ) ( )
m n
1
min C  y (i ) cost 1 θT x + 1 − y (i ) cost 0 θT x +   j2
θ 2 j =1
i =1
SVM Objective Function
 ( ) ( ) ( )
m
1 n 2
min C  y cost 1 θ x + 1 − y
(i ) T (i )
cost 0 θ x +   j
T
θ 2 j =1
i =1
Unlike logistic regression, the SVM output cannot be

treated as a probability
( )
hθ (x ) = sign θT x
Large Margin Classifier
 ( ) ( ) ( )
m
1 n 2
(i ) T (i )
cost 0 θ x +   j
T
θ 2 j =1
i =1
Cost Cost
-1 1 z = θTx -1 1 z = θTx
If y=1, we want θTx ≥ 1 If y=0, we want θTx ≤ -1

(not just θTx ≥ 0) (not just θTx < 0)
Large Margin Classifier
0
 ( ) ( ) ( )
m
1 n 2
(i ) T (i )
cost 0 θ x +   j
T
θ 2 j =1
i =1
Suppose that we are exploring the set of (many possible)

decision boundaries that classify correctly our data (C is set to a
very large value):
If y=1, θTx ≥ 1
If y=0, θTx ≤ -1
The first term of the objective function is zero

Minimizing the second term has the effect of selecting the
solution with the maximum margin!
DUAL REPRESENTATION AND
KERNELS
Optimisation Function
Maximise
θ x + 0 = 0
T
margin
θT x +  0 = 1
Minimise:
θT x +  0  1 1 2 1 T
θ = θθ
2 2
Subject to:
𝑦𝑖 (𝛉𝑇 𝐱 𝑖 + 𝜃0 ) ≥ 1
θ x +  0 = −1
T
y =1 Classify
correctly
y = −1
θT x +  0  −1
This is a quadratic
optimisation problem
Dual Representation
OBJECTIVE: Obtain 𝛉, 𝜃0 belonging to the solution hyperplane: 𝛉𝑇 𝐱 𝑖 + 𝜃0 =0
This is solved as a problem of quadratic optimisation:
1
min Φ(𝛉) = 𝛉𝑇 𝛉
𝛉,𝜃0 2
Subject to:
𝑦𝑖 (𝛉𝑇 𝐱 𝑖 + 𝜃0 ) ≥ 1
} Þ max
𝜶
(min(𝐿(𝛉, 𝜃0 , 𝜶)))
𝛉,𝜃 0
SVM PRIMAL
Construction of the Lagrangian L :
L(x, a ) = f (x) + å ai gi (x) "ai ³ 0

i
Constraints
Function to optimise
Lagrange multipliers
𝑓 𝑥 →Φ 𝛉
𝑔𝑖 𝑥 → 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1 ≥ 0
Dual Representation
This is solved as a problem of quadratic optimisation: The Lagrangian L of the SVM problem is:
1
1 𝐿(𝛉, 𝜃0 , 𝜶) = 𝛉𝑇 𝛉 − ෍ 𝛼𝑖 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1
min Φ(𝛉) = 𝛉𝑇 𝛉 2
𝛉,𝜃0 2
Subject to:
𝑦𝑖 (𝛉𝑇 𝐱 𝑖 + 𝜃0 ) ≥ 1
} Þ max
𝜶
(min(𝐿(𝛉, 𝜃0 , 𝜶)))
𝛉,𝜃 0
SVM PRIMAL
𝑖
L(x, a ) = f (x) + å ai gi (x) "ai ³ 0

i
Constraints
𝑔𝑖 𝑥 → 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1 ≥ 0
Dual Representation
1
𝛉,𝜃0 2
Subject to:
𝑦𝑖 (𝛉𝑇 𝐱 𝑖 + 𝜃0 ) ≥ 1
} Þ max
𝜶
(min(𝐿(𝛉, 𝜃0 , 𝜶)))
𝛉,𝜃 0
SVM PRIMAL
𝑖
The minimisation of L with respect to 𝛉, 𝜃0 implies:

𝜕𝐿
= 𝛉 − σ𝑖 𝛼𝑖 𝑦𝑖 𝐱 𝑖 = 0 ⟹ 𝛉 = σ𝑖 𝛼𝑖 𝑦𝑖 𝐱 𝑖
𝜕𝛉
𝜕𝐿
= − σ𝑖 𝛼𝑖 𝑦𝑖 = 0 ⟹ σ𝑖 𝛼𝑖 𝑦𝑖 = 0 𝛼𝑖 ≥ 0
𝜕𝜃0
L(x, a ) = f (x) + å ai gi (x) "ai ³ 0

i
Constraints
𝑔𝑖 𝑥 → 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1 ≥ 0
Dual Representation
1
𝛉,𝜃0 2
Subject to:
𝑦𝑖 (𝛉𝑇 𝐱 𝑖 + 𝜃0 ) ≥ 1
} Þ max
𝜶
(min(𝐿(𝛉, 𝜃0 , 𝜶)))
𝛉,𝜃 0
SVM PRIMAL
𝑖

𝜕𝐿
𝜕𝛉
𝜕𝐿
= − σ𝑖 𝛼𝑖 𝑦𝑖 = 0 ⟹ σ𝑖 𝛼𝑖 𝑦𝑖 = 0 𝛼𝑖 ≥ 0
𝜕𝜃0
𝐿(𝛉, 𝜃0 , 𝜶)
𝑇 𝑇
1
L(x, a ) = f (x) + å ai gi (x) "ai ³ 0 =
2
෍ 𝛼𝑖 𝑦𝑖 𝐱 𝑖 ෍ 𝛼𝑗 𝑦𝑗 𝐱𝑗 − ෍ 𝛼𝑖 𝑦𝑖 ෍ 𝛼𝑗 𝑦𝑗 𝐱𝑗 𝐱 𝑖 + ෍ 𝛼𝑖
i
𝑖 𝑗 𝑖 𝑗 𝑖
1
Constraints = − ෍ ෍ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱 𝑗 + ෍ 𝛼𝑖
Function to optimise 2
𝑖 𝑗 𝑖
𝑔𝑖 𝑥 → 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1 ≥ 0
Dual Representation
1
𝛉,𝜃0 2
Subject to:
𝑦𝑖 (𝛉𝑇 𝐱 𝑖 + 𝜃0 ) ≥ 1
} Þ max
𝜶
(min(𝐿(𝛉, 𝜃0 , 𝜶)))
𝛉,𝜃 0
SVM PRIMAL
𝑖

𝜕𝐿
𝜕𝛉
𝜕𝐿
= − σ𝑖 𝛼𝑖 𝑦𝑖 = 0 ⟹ σ𝑖 𝛼𝑖 𝑦𝑖 = 0 𝛼𝑖 ≥ 0
𝜕𝜃0
𝐿(𝛉, 𝜃0 , 𝜶)
𝑇 𝑇
1
L(x, a ) = f (x) + å ai gi (x) "ai ³ 0 =
2
෍ 𝛼𝑖 𝑦𝑖 𝐱 𝑖 ෍ 𝛼𝑗 𝑦𝑗 𝐱𝑗 − ෍ 𝛼𝑖 𝑦𝑖 ෍ 𝛼𝑗 𝑦𝑗 𝐱𝑗 𝐱 𝑖 + ෍ 𝛼𝑖
i
𝑖 𝑗 𝑖 𝑗 𝑖
1
Constraints = − ෍ ෍ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱 𝑗 + ෍ 𝛼𝑖
Function to optimise 2
𝑖 𝑗 𝑖
1
𝑓 𝑥 →Φ 𝛉 max Φ(𝜶) = − ෍ ෍ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱 𝑗 + ෍ 𝛼𝑖
𝜶 2
𝑖 𝑗 𝑖
𝑔𝑖 𝑥 → 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1 ≥ 0
Subject to: σ𝑖 𝛼𝑖 𝑦𝑖 = 0 𝛼𝑖 ≥ 0
SVM DUAL
Dual Representation
1
𝛉,𝜃0 2
Subject to:
𝑦𝑖 (𝛉𝑇 𝐱 𝑖 + 𝜃0 ) ≥ 1
} Þ max
𝜶
(min(𝐿(𝛉, 𝜃0 , 𝜶)))
𝛉,𝜃 0
SVM PRIMAL
𝑖

𝜕𝐿
𝜕𝛉
𝜕𝐿
= − σ𝑖 𝛼𝑖 𝑦𝑖 = 0 ⟹ σ𝑖 𝛼𝑖 𝑦𝑖 = 0 𝛼𝑖 ≥ 0
𝜕𝜃0
1 𝐿(𝛉, 𝜃0 , 𝜶)
max Φ(𝜶) = − ෍ ෍ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱 𝑗 + ෍ 𝛼𝑖 𝑇 𝑇
𝜶 2 1
𝑖 𝑗 𝑖 = ෍ 𝛼𝑖 𝑦𝑖 𝐱 𝑖 ෍ 𝛼𝑗 𝑦𝑗 𝐱𝑗 − ෍ 𝛼𝑖 𝑦𝑖 ෍ 𝛼𝑗 𝑦𝑗 𝐱𝑗 𝐱 𝑖 + ෍ 𝛼𝑖
2
Subject to: σ𝑖 𝛼𝑖 𝑦𝑖 = 0 𝛼𝑖 ≥ 0 𝑖 𝑗 𝑖 𝑗 𝑖
SVM DUAL 1
= − ෍ ෍ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱 𝑗 + ෍ 𝛼𝑖
2
𝑖 𝑗 𝑖
𝛉 = ෍ 𝛼𝑖 𝑦𝑖 𝐱𝑖
𝑖
ൢ⟹ 𝑓 𝐱 = sgn ෍ 𝛼𝑖 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0
𝜃0 = 𝑦𝑖 − 𝛉𝑇 𝐱 𝑖 𝑖
SVM Classifier
The Kernel Trick
1. Transformar la dimensión de una feature a una otra (superior)
2. Definir una función que calcule el producto escalar en esta nueva dimensión
3. Ahora nuestra feature es linealmente seperable (en la nueva dimensión) If classes are not linearly
separable, SVM will not work
1D
The solution is to transform
the feature space into
2D another one (of higher
dimension), in which our
classes are linearly separable
It is not necessary to define

this transformation explicitly.
It is only necessary to define
a similarity function that
calculates the dot-product in
this new space. This is called
a kernel
The Kernel Trick
The Kernel Trick
Mapping function:
1D x   (x)
x Kernel function:
2D
K (x1 , x2 ) =  (x1 )T  (x2 )
 (x)
The Kernel Trick
Dual Form
Maximise: Subject to:
~ 1
L ( ) = −   i j yi yi xTi x j +   i
N
i  0  y =0
2 i j i
i i
i =0
Kernel SVM
Maximise: Subject to:
L ( ) = −  i j yi yi K (xi , x j ) +  i
~ 1 N
i  0  y =0
2 i j i
i i
i =0
 0 = yi − θT  (xi ) = yi −  y j i K (x j , xi )
j
h (x) = sign ( yi i K (x, x i ) + b)

i
Example Kernel Functions
Kernel functions introduce their own parameters, the value of which
is generally fixed through cross-validation
Polynomial: ( )
K (x, z) = xT z
d
2
− x − z / 2
Radial Basis Functions: K (x, z) = e
Sigmoid: K (x, z ) = tanh(  x, z −  )
2 + c 2 ) −1
1/ 2
Inverse Multiquadric: K (x, z) = ( x − z
n
Intersection Kernel: K (x, z ) =  min( x(i), z (i))
i =1
The Kernel functions must be continuous, symmetric, and most preferably should have a
positive (semi-) definite Gram matrix. Kernels which are said to satisfy the Mercer’s
small theorem are positive semi-definite, meaning their kernel matrices have only non-
negative Eigen values. The use of a positive definite kernel insures that the
print optimization problem will be convex and solution will be unique.
http://crsouza.com/2010/03/17/kernel-functions-for-machine-learning-applications/
SOFT MARGIN, SLACK VARIABLES
x1
In this case, we can relax

the maximum margin
condition (soft margin)
This implies adding a

tolerance to errors,
controlled by what is
called slack variables
x1
 ( ) ( ) ( )
m
1 n 2
x2 min C  y cost 1 θ x + 1 − y
(i ) T (i )
cost 0 θ x +   j
T
θ 2 j =1
i =1
1
C=

C very large
C not too large
x1
Primal Form
x2
Minimise: Subject to:
1 T
θ θ + C  i yi (θT xi + 0 )  1 − i
2 i
1 The impact of slack

variables is modulated in
the optimization problem
by a parameter C, that
can be interpreted as a
regularization parameter
2 (maximization of margin
vs minimization of the
accepted error)
x1
Dual Form
x2
Maximise:
~ 1
L ( ) = −   i j yi y j xTi x j +   i
2 i j i
1 Subject to:
N
 y
i =0
i i =0
C  i  0
2
In the dual representation
x1 slack variables imply a new
inequality that regulates
the contribution of the
vectors
Support Vector Machines (SVM)
Key concepts:
• SVM as a quadratic optimisation problem.
• The dual problem allows the explicit solution from scalar products.
• The solution in form of scalar products allow the introduction of

kernels.
• Kernel functions allow to classify non linearly separable datasets.
• The slack variables introduce a regularisation factor that allows

tolerance to errors by relaxing the condition of maximal margin.
• Both the kernel parameters and the regularisation factor must be

adjusted during the training stage. + penalty + underfitting
- penalty + overfitting
C = 1/reg.
+ C + overfitting
- C + underfitting

Ac-Ed l05 - SMVS, Kernels

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ac-Ed l05 - SMVS, Kernels

Uploaded by

Copyright:

Available Formats

Machine Learning

Support Vector Machine, Kernels

Credit to: Dimos Karatzas, A. Ng, A. Karpathy, T. Mitchell

SUPPORT VECTOR MACHINES

Discriminative Models: use

Maximal margin: the solution

Buscar un hiperplano con la separación entre

Maximal margin: the solution

Support vectors: only a few

Es mas robusto ante outliers

In this case, we can relax

This implies adding a

The Kernel Trick lineal?

Transformamos el espacio a otro

It is not necessary to define

In the rest of the discussion we will

Distance can be written:

37 θ The same hyperplane can be defined by

Therefore the distance we are

Cost (h ( x), y ) = − y log (h ( x)) − (1 − y )log (1 − h ( x))

If y = 1 : Cost (h ( x), y ) = − log (h ( x))

If y=1, we want θTx ≥ 1 If y=0, we want θTx ≤ -1

Unlike logistic regression, the SVM output cannot be

If y=1, we want θTx ≥ 1 If y=0, we want θTx ≤ -1

Suppose that we are exploring the set of (many possible)

The first term of the objective function is zero

Construction of the Lagrangian L :

L(x, a ) = f (x) + å ai gi (x) "ai ³ 0

Construction of the Lagrangian L :

L(x, a ) = f (x) + å ai gi (x) "ai ³ 0

The minimisation of L with respect to 𝛉, 𝜃0 implies:

L(x, a ) = f (x) + å ai gi (x) "ai ³ 0

The minimisation of L with respect to 𝛉, 𝜃0 implies:

The minimisation of L with respect to 𝛉, 𝜃0 implies:

The minimisation of L with respect to 𝛉, 𝜃0 implies:

It is not necessary to define

h (x) = sign ( yi i K (x, x i ) + b)

Sigmoid: K (x, z ) = tanh(  x, z −  )

In this case, we can relax

This implies adding a

C not too large

1 The impact of slack

• SVM as a quadratic optimisation problem.

• The solution in form of scalar products allow the introduction of

• Kernel functions allow to classify non linearly separable datasets.

• The slack variables introduce a regularisation factor that allows

• Both the kernel parameters and the regularisation factor must be

You might also like