Wolfe conditions for deciding step length in inexact line search: An example with the Rosenbrock function

In inexact line search (a numerical optimisation technique) the step length (or learning rate) must be decided. In connection to that the Wolfe conditions are central. Here I will give an example showing why they are useful. More on this topic can be read elsewhere, e.g. in the book of Nocedal and Wright (2006), “Numerical Optimization”, Springer.

Wolfe conditions

The Wolfe conditions consists of the sufficient decrease condition (SDC) and curvature condition (CC):

(SDC)f(xk+αkpk)f(xk)+c1αkfkpk(CC)f(xk+αkpk)pkc2fkpk, with 0<c1<c2<1.

The strong Wolfe conditions are:

(SDC)f(xk+αkpk)f(xk)+c1αkfkpk,(CC')|f(xk+αkpk)pk|c2|fkpk|, with 0<c1<c2<1.

Example

We use the Rosenbrock function to illustrate these conditions when choosing step length. We use a classic version, namely f(x,y)=(1x)2+100(yx2)2 Its global minimum is at (1,1) and the gradient is f(x,y)=(2(1x)400x(yx2),200(yx2)).

A few illustrations of the function f (the global minimum marked with a red cross):

Let z=(x,y) and zk=(xk,yk), and further f(z)=f(x,y) for easier notation. The line search is then zk+1=zk+αkpk for direction pk and step length αk. In determining the step length the function Φ(α)=f(zk+αpk) is often used. It tells us the value of the function we are minimising for any given step length α. Recall that the current position is zk and the direction pk is chosen. The only thing left to choose is α. So to make that more clear, we refer to Φ(α) instead of f(zk+αpk).

The aim is that f(zk+αpk)<f(zk), and if the direction chosen is a descent direction (e.g. the negative gradient), then for sufficiently small α this will be possible. But just using f(zk+αpk)<f(zk) as a criteria is not sufficient to ensure convergence, which is why the Wolfe conditions are used.

Now, say we start the line search at z0=(x0,y0)=(2.2,3), then the gradient at that point is f0=f(z)|z=z0=(1625.6,368).

Say we to a line search from z0=(x0,y0) in the direction of the negative gradient given by p0=(1625.6,368).

In z0=(x0,y0) we have that Φ(α)=f(z0+αp0).

Note that for α=1 then z0+αp0=((2.2)+(1625.6),(3)+(368))=(1623.4,371), so instead smaller values of α are tried.

The path z0+αp0 for α between 0 and 0.003 (chosen to make this example illustrative) is then:

This can instead be shown by visualising Φ(α) instead as this is a univariate function, and f can be difficult/impossible to visualise directly as in this example. For this example, Φ(α) looks like this:

As seen, if α is sufficiently small, then we move to a position with a smaller value for f. But we can also end up in a place with a higher value (although we are moving in the descent direction).

The sufficient decrease condition

We now consider the Wolfe conditions. The SDC (sufficient decrease condition) is (SDC)Φ(α)=f(zk+αpk)f(zk)+c1αfkpk with 0<c1<1. Let us disect this condition.

First, we focus on fkpk. Note that fkpk=pkfk=pkfxcos(θ) (the latter equality in an Euclidian setting). What do we know about θ? We require that pk is a descent direction (“point in the same way as the negative gradient, fk”) and thus the angle between pk and the negative gradient, fk, is (π/2,π/2).

Instead of the negative gradient, we consider the gradient. So multiplying with 1 means that fkpk<0.

Further, f(zk) is a value in R. And as c1>0, α>0 and fkpk<0 we have that c1fkpk<0. In other words, the condition can be written as Φ(α)β0+αβ1 where β0=f(zk) and β1=c1fkpk<0. So a stright line with negative slope.

In this case, β0=f(z0)=348.8 and β1=c1fkpk (using pk=fk). See below for some choices of c1:

In other words, varying c1 from 0 to 1 gives straight lines from Φ(α)=f(zk) for c1=0 to Φ(α)=f(zk)+αfkpk for c1=1, where the latter corresponds to the tangent of Φ(α) at α=0.

To see this note that at α=0, the tangent to Φ(α) is c1fkpk. This should be the same as the directional derivative of the objective function f in the direction of pk. Which exactly happens for c1=1.

So the factor fkpk makes it possible to go from the extreme cases with a horizontal line to that of the directional derivative of the objective function f in the direction of pk.

The curvature condition

The curvature condition (CC) is (CC)f(zk+αkpk)pkc2fkpk.

We are in zk and looking in direction pk which is a descent direction. We know from before that fkpk<0, so it goes downhill from where we are now. We are looking for a critical point z such that f(z)=0. So at the next point zk+1=zk+αkpk we think the gradient should be less negative than it is now; still looking in direction pk.

So the directional derivative at the next iterate continuing in the same direction as got us here has to be greater than c2fkpk, with c2 controlling the expression to range from 0 to fkpk, i.e. the directional derivative where we are standing now (zk).

Combining

Result for c1=0.025 and c2=0.2:

The strong Wolfe conditions

The curvature condition in the strong Wolfe conditions is:

(CC')|f(zk+αkpk)pk|c2|fkpk|.

This is very similar to CC, except now the directional derivative at the next iterate continuing in the same direction as got us here cannot be too positive.

Again combining with c1=0.025 and c2=0.2 we obtain:

Summary

Small values of c1 (close to 0) means that we can go far (limited almost only by a horizontal line). Large values of c2 (close to 1) means that the directional derivative in the next iterate can be almost as negative as it currently is.

Avatar
Mikkel Meyer Andersen
Assoc. Professor of Applied Statistics

My research interests include applied statistics and computational statistics.

Related