Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

UNIVERSITY OF SOUTHAMPTON

MATHEMATICS

MATH3082 Optimization
2021-22
MATH3082: Optimization (2020-21)

Lecturers
Professor Hou-Duo Qi Dr Vuong Phan
(for Linear Programming) (for Nonlinear Programming)
Email H.Qi@soton.ac.uk T.V.Phan@soton.ac.uk

Lectures
See your timetable

Computer Laboratory (to be conrmed)


Assessment
15% Coursework
85% Written Examination

Content
Ch.1: Introduction to Optimization

Ch.2: Simplex Methods for Linear Programming

Ch.3: Duality Theory

Ch.4: Nonlinear Optimization

References
[1] R. Vanderbei, Linear Programming: Foundation and Extension. (An excellent on-line
book).

[2] J. Nocedal and S. Wright, Numerical Optimization (1st and 2nd ed. in Hartley Li-
brary).

[3] S. Boyd and L. Vandenberghe, Convex Optimization (online book).


[4] M.C. Ferris, O.L. Mangasarian, and S.J. Wright, Linear Programming with MAT-
LAB, MPS-SIAM Series on Optimization, 2007.

2
1 Introduction to Optimization
Nothing happens in the universe that does not have a sense of either certain maxi-
mum or minimum. L. Euler, Swiss Mathematician and Physicist, 1707-1783.

In this chapter, we start with a few interesting problems that can be modelled by linear or
nonlinear programming. Many other examples can be found in many standard text books on
optimization (e.g., those books recommended). We then give a formal denition of the linear
programming (LP). We nish this chapter by introducing the graphical method for LPs with
just two variables. It will motivate us to study the Simplex method in next chapter. General
nonlinear programming will be introduced in later chapters.

1.1 Motivating Examples


1.1.1 Sparse solutions via `1 regularization
Suppose we have a coding system. The m×n matrix A is a coding book, each column
represents a code. The combination of columns sending a message x to an encoder outputting a
signal b ∈ IRm . Suppose the coding mechanics is linear:

Ax = b.

Suppose n  m and the number of columns of A used is very small. This means that we would
like decode x with a small number of non-zeros in x. Let kxk0 (known as the zero-norm of x)
be the count of the non-zero elements of x. We would like to seek a solution x that has small
s = kxk0 .
This problem can be solved (under some reasonable conditions) by the `1 minimization:

min kxk1 = |x1 | + |x2 | + · · · + |xn |, subject to Ax = b. (1.1)

Figure 1.1 illustrates one such instance that

m = 100, n = 1000, s = 20.

That is, the system has 100 equations, 1000 variables. We would like to nd a solution which has
only 20 non-zero elements. Hence, the solution is sparse comparing to its dimension n = 1000.
You may wonder why `1 minimization works for this case. Let us look at a simple example in
2 dimensions (Fig. 1.2), where we have only one linear equation. It can be clearly seen that the
optimal solution has one element (out of 2) taking the value 0.

1.1.2 Linear Classication


Example 1.1. (Linear Classication) Suppose we have two data sets S1 and S2 in IR2 . The
rst set contains m points and the second k points. Our aim is to nd a line that separates these
two sets as accurately as possible.
Solution: The Separable Case. Suppose that there exists a line (i.e., hyperplane in higher
dimensions):
wT x = γ, w ∈ IR2 and γ ∈ IR,
separating the two sets. That is

wT ai ≥ γ, for ai ∈ S1 ,

3
Figure 1.1: `1 minimization leads to sparse solutions

and
wT bi ≤ γ, for bi ∈ S2 .
There is a trivial solution (w, γ) = (0, 0) to the above conditions. To guard against a trivial
answer, we seek to enforce the stronger conditions:

wT ai ≥ γ + 1, for ai ∈ S1 , (1.2)

and
wT bi ≤ γ − 1, for bi ∈ S2 . (1.3)

We want to nd such a line characterized by (w, γ). This certainly can be reformulated as a
linear programming problem:

min 0
subject to wT ai ≥ γ + 1, for ai ∈ S1
wT bi ≤ γ − 1, for bi ∈ S2 .
Note: There is an implicit assumption: Existence of separating lines. What shall we do if no
separating planes exist?

Solution continued: The Inseparable Case∗ . If no lines exist to satisfy conditions (1.2) or
(1.3), there must exist violations. But we generally do not know which one is violated. Let us
consider the ith inequality in (1.2):
wT ai ≥ γ + 1.
If this inequality is violated, then the violation yi can be calculated by

yi = (γ + 1) − wT ai and yi ≥ 0. (1.4)

4
Figure 1.2: `1 minimization in 2 dimensions

If this inequality is not violated, the violation must be zero, i.e.,


yi = 0 when the ith inequality is not violated. (1.5)

A unied way to combine (1.4) and (1.5) is

yi = max{(γ + 1) − wT ai , 0}. (1.6)

Similarly, the violation zi for (1.3) can be calculated by

zi = max{wT bi − (γ − 1), 0}. (1.7)

We would like to nd a pair (w, γ) that minimizes the total violation:

m
X k
X
yi + zi ,
i=1 i=1

where yi and zi are dened in (1.6) and (1.7) respectively.


Therefore, our optimization problem for this case is:
Pm
+ ki=1 zi
P
min i=1 yi
subject to yi = max{(γ + 1) − wT ai , 0}, for ai ∈ S1
zi = max{wT bi − (γ − 1), 0}, for bi ∈ S2 .
This is equivalent to
Pm
+ ki=1 zi
P
min i=1 yi
subject to yi ≥ γ + 1 − wT ai , for ai ∈ S1
zi ≥ wT bi − (γ − 1), for bi ∈ S2
yi ≥ 0, zi ≥ 0.
Note that this formulation covers the strictly separable case: All yi = 0 and zi = 0.

5
20 10

15 8

6
10

4
5

2
0
0

−5
−2

−10
−4

−15 −6

−20 −8
−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8

Figure 1.3: Strictly separable case Figure 1.4: Inseparable case

Example 1.2. (Diet Problem) A nutritionist is planning a menu consisting of two main foods
A and B. Each ounce of A contains 2 units of fat, 1 unit of carbohydrate, and 4 units of protein.
Each ounce of B contains 3 units of fat, 3 units of carbohydrates, and 3 units of protein. The
nutritionist wants the meal to provide at least 18 units of fat, at least 12 units of carbohydrate,
and at least 24 units of protein. If an ounce of A costs 20 pence and an ounce of B costs 25
pence, how many ounces of each food should be served to minimize the cost of the meal yet satisfy
the nutritionist's requirement?

Solution: LP Model:

Step 1: Set up Variables. Let x1 and x2 denote the number of ounces of food A and B, which
are to be served.

Step 2: Set up Objective Function. The cost of the meal, which is to be minimized, is:

f (x1 , x2 ) := 20x1 + 25x2

Step 3: Set up Constraints.


Constraint 1: The number of units of fat in the meal is no less than 18:

2x1 + 3x2 ≥ 18.

Constraint 2:

x1 + 3x2 ≥ 12 (carbohydrate constraint)

Constraint 3:

4x1 + 3x2 ≥ 24 (protein constraint)

Constraint 4:

x1 ≥ 0, x2 ≥ 0 (nonnegativity constraint)

6
Step 4: Set up LP problem.

minimize f = 20x1 + 25x2


subject to 2x1 + 3x2 ≥ 18
x1 + 3x2 ≥ 12
4x1 + 3x2 ≥ 24
x1 ≥ 0, x2 ≥ 0.

1.1.3 Sensor network localization


This type of problems often arises from engineering science and it can be described as follows.
Suppose we have a few sensors of known locations, ai , i = 1, . . . , m with ai ∈ IRr and r = 2
or 3. We also have a number of unknown sensors xi ∈ IRr , i = 1, . . . , n. Those anchors and
the unknown sensors form a network. Each sensor communicates with its neighbours. which are
other sensors and/or the anchors with its radio range. Hence, the neighbours of each sensor are
known. Accordingly we dene:

Nix := {xj | xi communicates with xj , j 6= i, j = 1, . . . , n}


Nia := {aj | xi communicates with aj , j = 1, . . . , m} .

We also suppose that the communication between xi and its neighbours can be converted to
distances, denoted by δij . An example of 50 sensors with 4 anchors is depicted in Figure 1.5.

Figure 1.5: Sensor network localization with partial and noisy observations

The question is to locate the unknown sensors xi such that their distances should be as close as
possible to those observed δij . This can be modelled as an optimization problem:

n X 
X 2 n X 
X 2
min f (x1 , . . . , xn ) = kxi − xj k − δij + kxi − aj k − δij ,
i=1 j∈Nix i=1 j∈Nia

where the norm is the Euclidean norm. The problem is challenging because of the following
computational issues:

(a) There are nr variables (xi ∈ IRr and there are n of them). If n is large, this is a large scale
problem.

7
(b) The objective function is not dierentiable at some points.

(c) It is not convex (to be introduced later).

(d) It has many local minima

Figure 1.6 presents a recovery of 50 sensors based on partially observed {δij }. How to solve such
kind of problems is a topic in this course.

Figure 1.6: Sensor network localization with partial and noisy observations in the unit square [−0.5, 0.5] : 4
2

blue points are anchors, noisy level at 10% and the radio range is 0.5. ◦ are the true locations and ∗ represents
the recovered locations. The corresponding pair are linked by a line.

1.2 Convex vs Non-Convex Optimization


As indicated in the above examples, optimization problems have an objective to minimize or max-
imize, subject to some constraints or none constraints. The objective functions and constraints
functions can be linear or nonlinear. In this section, we list some of the common optimization
models that we are going to see in this course. For those models, we may ask what properties
would make them computationally tractable. This questions leads to the important concept of
convex functions.

1.2.1 Unconstrained Optimization


Suppose there is a function f (·) : IRn 7→ IR. That is, the function is dened over the whole
n
nite-dimensional space IR . We would to nd a point x where the function reaches its minimal
value:
min f (x) subject to x ∈ IRn .
Since there is no restriction on the variable x, we call the function unconstrained.

8
A well-known unconstrained problem is to minimize Rosenbrock function (1960) of two variables

min f (x, y) = (1 − x)2 + 100(y − x2 )2 .

The global solution is (x, y) = (1, 1). Suppose we do not know it. It is not obvious what methods
would quickly lead to the optimal solution. The function is plotted below:

The sensor network localization is another example of unconstrained optimization and it is one
of the most challenging problems to solve when n is big.

1.2.2 Constrained Optimization


There are often applications where the variables x have to satisfy certain conditions. For exam-
ple, the `1 minimization problem (1.1) only consider those points x that satisfy the condition
Ax = b. We call such problem constrained.

In general, we have functions:

f (x): IRn 7→ IR, as objective function.

hi (x): IRn 7→ IR, i = 1, . . . , m0 as equality constraints

gj (x): IRn 7→ IR, j = 1, . . . , m1 as inequality constraints.

They dene the constrained optimization problem by

min f (x)
s.t. hi (x) = 0, i = 1, . . . , m0
(1.8)
gj (x) ≤ 0, j = 1, . . . , m1
x ∈ C,

9
where C is a subset in IRn . For example, in the diet problem, we can choose
n o
C = (x1 , x2 ) x1 ≥ 0, x2 ≥ 0 .

It is not hard to formulate many practical problems into the form of (1.8). One key issue,
however, is how you would solve it and what makes the problem easier to solve. There is a
common consensus that if a problem is convex, there should be some ecient algorithms to solve
it.

1.2.3 Convex Optimization


We dene convex sets and convex functions.

Denition 1.3. A set C ∈ IRn is a convex set if it contains the entire line segment between
every pair of points in C . That is,
λx + (1 − λ)y ∈ C, whenever x, y ∈ C and 0 ≤ λ ≤ 1.
There are many convex sets.

ˆ The square box is convex


n o
(x1 , x2 ) 0 ≤ x1 ≤ 1, 0 ≤ x2 ≤ 1 .

ˆ The solutions of linear equation from a convex set:


n o
C = x ∈ IRn Ax = b .

In particular, the set is called polyhedral if it can be put in the following form:
n o
C = x ∈ IRn Ax ≤ b .

ˆ Any section in the Fig. 1.7 is convex.

Denition 1.4. (Extreme point) For a convex set C ∈ IRn , a point xb ∈ C is called an extreme
point of C if it is not on the interior of any line segment wholly contained in C .
Denition 1.5. A function f : IRn 7→ IR is convex if for any x, y ∈ IRn , it holds

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y), 0 ≤ λ ≤ 1.

There are many convex functions.

ˆ Linear functions are convex:


f (x) = cT x,
where c is a vector in IRn .

ˆ f (x) = x2 is convex.

ˆ The function
f (X) = − log det(X), X is positive denite matrix

is convex.

10
Figure 1.7: Conic Sections

Denition 1.6. The problem (1.8) is called a convex optimization if

f (x) : IRn 7→ IR, is convex


hi (x) : IRn 7→ IR, i = 1, . . . , m0 are linear
gj (x) : IRn 7→ IR, j = 1, . . . , m1 are convex, and
C: is convex.

The problem (1.8) is called linear programming problem if all the functions f, hi , gj are linear
and C is polyhedral.

We also note that minimizing function f (x) is equivalent to maximizing (−f (x)):

min f (x) ⇐⇒ max −f (x).

So in optimization, it does not matter if we consider a maximization or minimization problem.


Which form to consider really depends on the convenience in developing theory and algorithms.

11
1.3 Linear Programming
1.3.1 The standard form
The standard form of linear programming takes the following form:
maximize z= c1 x1 + c2 x2 + · · · + cn xn
subject to a11 x1 + a12 x2 + · · · + a1n xn ≤ b1
a21 x1 + a22 x2 + · · · + a2n xn ≤ b2
. (1.9)
.
.
am1 x1 + am2 x2 + · · · + amn xn ≤ bm
and

x1 ≥ 0, x2 ≥ 0, . . . , xn ≥ 0.
Common terminologies used in linear programming are as follows:

(a) Objective function. The function being maximized, c1 x1 + . . . + cn xn , is called the


objective function.

(b) Constraints. The linear inequalities in the restrictions are referred as constraints. For
example, the rst constraint is: a11 x1 + . . . + a1n xn ≤ b1 . There are total m constraints in
(1.9).

(c) Nonnegativity constraints. The xj ≥ 0 are often called nonnegativity constraints.


(d) Parameters. The input constants  the aij , bi , cj are often referred to as parameters.
They are often called coecients in the LP. For example, c1 is the coecient of variable
x1 in the objective function.

(e) x1 , . . . , x n are called variables of the LP. n is the number of variables and m is the number
of constraints not including the nonnegativity constraints.

Standard form in Matrix-Vector format.


It is convenient to write linear programming problems in matrix notation. Let

       
a11 · · · a1n x1 b1 c1
 a21 · · · a2n   x2   b2
      c2 
A= .  , x =  ..  , b =  .. , c =  . ,
   
. .
 .. .
.
.
.   .   .  .
 . 
am1 · · · amn xn bm cn

we can write the standard linear programming problem as

maximize z = cT x
subject to Ax ≤ b
x ≥ 0.

12
1.3.2 Other forms
Linear programming problems may appear in dierent forms other than the standard form. But
no matter what forms an LP may be formulated to, they can all be converted into the standard
form. Common forms are the following.

(a) Minimizing rather than maximizing the objective function. But we have the following
equivalence:

min cT x ⇐⇒ max −cT x.

(b) Some constraints with a greater-than-or-equal-to inequality:

ai1 x1 + a12 x2 + · · · + ain xn ≥ bi , for some i.

This can be converted to a less-than-or-equal-to inequality by multiplying −1 on both sides


of the inequality:

−ai1 x1 − a12 x2 − · · · − ain xn ≤ −bi , for some i.

(c) Some constraints are in equation form:

a x + a12 x2 + · · · + ain xn = bi , for some i.


i1 1
ai1 x1 + a12 x2 + · · · + ain xn ≤ bi
⇐⇒
−ai1 x1 − a12 x2 − · · · − ain xn ≤ −bi , for some i.

(d) The nonnegativity constraints for some decision variables are absent.

xi unrestricted in sign ⇐⇒ xi := xi1 − xi2 , xi1 ≥ 0, xi2 ≥ 0.

For example, we have

min z = 2x1 − 3x2 max z 0 = −2x1 + 3(x3 − x4 )


s.t. x1 + 2x2 ≥ 10 ⇐⇒ s.t. −x1 − 2(x3 − x4 ) ≤ −10
x1 ≥ 0, x2 is free x1 ≥ 0, x3 ≥ 0, x4 ≥ 0.

1.4 Graphical Method

1.4.1 Terminology for solutions in LP


Standard terminologies for solutions in LP are as follows:

(a) A feasible solution is a solution for which all the constraints are satised.
(b) The feasible region is the set of all feasible solutions.
(c) An optimal solution is a feasible solution that maximizes the objective function in the
standard LP.

Look at the following gures to see what feasible regions may look like in LP.

13
5 4

4.5
3.5
4
3
3.5

3
2.5 ← 4x1+3x2 =12
x2

x2
2.5 2

2 Feasible Region 1.5


1.5
1 ↓ 2x1+5x2 =10
1
Feasible Region
0.5
0.5

0 0
0 1 2 3 4 5 0 1 2 3 4 5
x1 x1

Figure 1.8: Feasible region (shaded) for the con- Figure 1.9: Feasible region (shaded) for the con-
straints: x 1 , ,
≥ 0 x2 ≥ 0 x1 ≤ 4 and x ≤ 4 2 straints: , ,
x1 ≥ 0 x2 ≥ 0 4x1 + 3x2 ≤ 12 and
2x1 + 5x2 ≤ 10

3.5

2.5 ← 4x1+3x2 =12


↓ z=12x1+15x2
x2

2
with z=66
1.5

1 ↓ 2x1+5x2 =10
Feasible Region
0.5

0
0 1 2 3 4 5
x1

Figure 1.10: Graphical method: Optimal solution x1 = 15/7, x2 = 8/7 with the optimal objective
functional value z = 300/7

1.4.2 Graphical Method


Now we introduce the graphical method. Let us consider the following example:

max z = 12x1 + 15x2


s.t. 4x1 + 3x2 ≤ 12
2x1 + 5x2 ≤ 10
x1 ≥ 0, x2 ≥ 0.

Steps of the graphical method for LPs with two variables:

S.1 Draw each constraint on a graph to decide the feasible region.

S.2 Draw the objective function on the graph.

S.3 Decide which corner point yields the largest (the smallest for minimization problem) ob-
jective function value. The corner point is the optimal solution.

14
An important theoretical result comes out of the graphical method is the following.

Theorem 1.7. For a linear programming problem, at least one corner point (also an extreme
point) is the optimal solution provided that the LP has an optimal solution.

1.5 Fundamental Theorem of Linear Programming


1.5.1 Basic Feasible Solution
We consider the following system:

Ax = b, x ≥ 0, (1.10)

where A is an m×n matrix with m ≤ n.

Rank assumption: We assume the rank of A is m. That is, A has a full row rank.

Suppose that B is an m×m sub-matrix of A and the rank of B is m. That is B has a full rank
and hence it is nonsingular. We consider the equation:

BxB = b,

whose solution is
xB = B −1 b.
Now we return to our original equation in (1.10). We split the matrix A in two parts:

A = [B, N ],

where N is an m × (n − m) matrix. The equation (1.10) is equivalent to

BxB + N xN = b.

In particular, we set xN = 0 (the zero vector of size (n − m). Then

x = [xB , 0]

is a solution of Ax = b. We have the following denition.

Denition 1.8. Consider the system (1.10) with the rank assumption that the rank of A is m.
Let B be an m × m be a submatrix of A and B has a full rank. Let

xB = B −1 b.

(i) The solution


x = [xB , 0]
is called a Basic Solution of (1.10).
(ii) If xB ≥ 0 (each of its elements is nonnegative), then

x = [xB , 0]

is called Basic Feasible Solution (BFS) of (1.10).

15
Example 1.9. Consider a system of (1.10) with
   
−1 2 2 3
A= , b= .
0 1 0 1

The rank of A is 2.

(i) Basic solution: Let


 
−1 2
B= .
0 1
Then B has     
−1 −1 2 3 −1
xB = B b= = .
0 1 1 1
So the solution
x = (x1 , x2 , x3 ) = (−1, 1, 0)

is a Basic Solution, but not a Basic Feasible Solution (because x1 < 0).

(ii) Basic Feasible Solution: Let


 
2 2
B= .
1 0
Then B has     
−1 1 0 −2 3 1
xB = B b=− = .
2 −1 2 1 1/2
So the solution
x = (x1 , x2 , x3 ) = (0, 1, 1/2)

is a Basic Feasible Solution.

BFS is very important in our understanding of linear programming. We discuss it below involving
minimum amount of mathematics (linear algebra).

1.5.2 Fundamental Theorem of Linear Programming


We consider the following linear programming

max cT x
(1.11)
s.t. Ax = b, x ≥ 0,

where A is an m × n matrix. We further assume that the matrix A satises the rank assumption
(it has rank m).

Theorem 1.10. Fundamental Theorem of Linear Programming Consider the linear program
(1.11) with A satisfying the rank assumption. The following hold.

(i) If there is a feasible solution, there must be a basic feasible solution.

(ii) If there is an optimal feasible solution, there must be an optimal basic feasible solution.

16
Proof. (i) Let ai be the ith column of A, i.e.,

A = [a1 , a2 , . . . , an ].

Suppose xT = (x1 , x2 , . . . , xn ) be a feasible point. We then have

x1 a1 + x2 a2 + · · · + xn an = b. (1.12)

Suppose that there are exactly k components in x that are not zero. We assume that the rst k
are not zero:
x1 > 0, x2 > 0, · · · , xk > 0, xk+1 = · · · = xn = 0.
Then the equation (1.12) becomes

x1 a1 + x2 a2 + · · · + xk ak = b.

We consider two cases depending on whether {ai }ki=1 are linearly independent.

Case 1. {a1 , a2 , · · · , ak } are linearly independent. Obviously k ≤ m. If k = m, then x is already


a basic feasible solution and we are done. If k < m, we can always pick up (m − k) columns
from ak+1 , . . . , an to make up a matrix (without loss of generality, we name those columns as
ak+1 , . . . , am )
B = [a1 , . . . , ak , ak+1 , . . . , am ]
such that B has a rank m. Let

xTB = (x1 , x, . . . , xk , 0, . . . , 0 )
| {z }
(m−k) zeros

Then,
BxB = b.
The fact that xB ≥ 0 means that x is a basic feasible solution.

Case 2. {a1 , a2 , · · · , ak } are linearly dependent. There exists y1 , y2 , . . . , yk (not all of them are
zero) such that
y1 a1 + y2 a2 + · · · + yk ak = 0.
Let
yT = (y1 , y2 , · · · , yk , 0, . . . , 0 )).
| {z }
(n−k) zeros

Let
z = x − y,
where  is a number.The key observation is that

Az = b, for any .

We choose (note: there exists at least one yk that is not zero. Without loss of generality, we may
assume there exists yk > 0)
 = min{xi /yi : yi > 0}.
Then z will have at least one more zero that x, and still is feasible solution. If the correspond-
ing columns are linearly independent, then z is a basic feasible solution. If the corresponding

17
columns are not linearly independent, we can repeat the above process to get another feasible
solution. We continue this process until we reach a set of independent columns because of the
rank assumption. We nally nd a basic feasible solution.

We omit the proof for (ii) as the proof is similar to the rst part. 

Remark: This theorem reduces the task of solving the linear program (1.11) to that of searching
over all basic feasible solutions. There are at most
 
n n!
=
m m!(n − m)!

basic solutions (a nite number of BFS). However, when n and m are getting large, this number
grows too fast and hence the search would be slow. We must execute a clever search, which leads
to the Simplex Method in the next chapter.

We state our nal result without giving a proof. The result relates the BFS to the extreme point
dened by the system (1.10). Let

K = {x ∈ IRn | Ax = b, x ≥ 0} .

Theorem 1.11. (Equivalence of Extreme Points and Basic Solutions) Consider the system (1.10)
with A satisfying the rank assumption. A vector x is an extreme point of K if and only if x is a
BFS to (1.10).

18

You might also like