Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Review of Distributed algorithms for basis pursuit

Final report prepared for Cpsc542f (2009W) Tim Lin December 18, 2009

Introduction

The goal of this report is to provide critical analysis of a distributed method for solving basis pursuit based on decomposition, proposed by Joao Mota first in his masters thesis [Mot08] and again later on in a conference paper [Mot09]. The scope of his work actually covers three entirely different algorithms for solving basis pursuit in a distributed fashion, each corresponding to a different nodal distribution scheme and underlying optimization methodology, in a more-or-less bijective fashion. However, the intention of this report is to only examine one of these, namely the subgradient-based decomposition method introduced in section 2.1 of the paper. To clarify, the method of concern here is referred to as Algorithm 1 in the paper and Algorithm 2 in the thesis, and will in this report be alluded to as Motas Algorithm. Unless otherwise noted, all references to Motas work is to the conference paper. Furthermore, the basis pursuit (BP) problem discussed here takes the general form of minimize x1 subject to Ax = y, (1)
xR

where, due to its connection with compressive sensing [CRT06], A is usually an m n matrix representing an underdetermined linear system with m n. The rest of this report will be organized as follows: first a brief overview on why only one algorithm is consider out of a total of three, followed by some historical 1

background to give some context to the ideas behind this algorithm. Finally, a critical study of Motas paper and thesis is carried out in the sections where the algorithm is introduced. This report seeks to comment on the sufficiency of its motivation, the presentation of the algorithm, and the thoroughness in proving his claims.
Justification for limited scope

Since this report virtually ignores two-thirds of the paper I feel that some explanation is naturally required. Basis pursuit is, partly thanks to compressive sensing, rapidly becoming a popular problem with a veritable myriad of practical applications. In turn, Motas paper covered a fairly comprehensive range of schemes for decomposing these problems, going so far as to introduce three entirely different approach. In any other context, each individual approach here would warrant a short paper onto its own. Since Mota formulated these approaches throughout his studies towards a Masters degree in engineering, my guess is that such a wide scope is borne mostly out of an aggressive program to develop a comprehensive toolset for a large number of conceivable applications. For example, the scheme introduced in section 3 naturally leads itself to a distributed sensor network arrangement producing independent measurements on the same signal. The delineation here is an A that represents the aggregate of the physical measurement processes in the sensors, and the resulting decomposition for this approach is therefore a vertical partitioning of A along different rows. Compare and contrast this to the ones presented in section 2, where a horizontal partitioning of A along the columns is considered. The system represented here is wholly different from the vertical partitioning case, with the algorithmic distinction that the decomposition is taken over unknown x itself, and not over a set of known observations y. A horizontal partitioning of A, while not physically significant, is nevertheless of central importance for formulating a distribution scheme. The most typical situations will lead to an A in which there are much more columns than rows, so often y fits into memory while x cannot. In this case, the only way to break down x requires a corresponding partitioning in A. This is the case that I am interested in (due to personal applications), and will be the scope of this report.

Background

The formal concept of separating a given optimization problem into independent subproblems can first be attributed to Philip Wolfe in 1959, borne out of a concrete 2

organization of tricks that he had independently observed from Ray Folkerson and George Dentzig. At the time both have worked on linear programming problems with constraint matrices that were too big to handle on a single computer, due to an overwhelming (by their standards) amount of data. Wolfe would often hear about these problems from Dentzig when they worked close to each other at the RAND corporation, and that is when he saw that both Falkerson and Dentzig developed specialized ad-hoc methods to overcome the size problem. After realizing that both Folkerson and Dentzig are converging to the same approach, Wolfe worked with Dentzig to further formalize it into a more complete general decomposition method for LP [Lus, DW60]. The result of this joint venture is the venerable Dentzig-Wolfe decomposition. This method can roughly be seen as an extension of the simplex method on LP with constraint matrices that have an almost block-diagonal structure. Examples of these are staircase matrices (slightly overlapping block-diagonals) and block angular matrices (strictly block-diagonal with a few extra dense rows or columns). Here the gist is that a particular partitioning can be chosen over the unknown x such that the total constraint set can be split into several local subsets (the strictly block-angular part of the matrix) and a remaining, hopefully much smaller global subset. This partitioning results in several almost separable LP subproblems along the natural division of block-local constraints. The premise of this approach is that local constraints can be satisfied using only knowledge of the local block submatrices, while the overall constraint can be satisfied by using a master program to reconcile the local sub-solutions. This masterslave structure arises from using a dual reformulation of the decomposed subproblems, turning the global constraint over each subproblems variables into a term of its Lagrangian function, such that a local solution is well-defined. Using Minkowski mapping [Teb01] one can show that a global Lagrangian multiplier exists for the subproblems global constraint term, and that the corresponding dual solution maps to the optimal solution in the original LP problem. The Lagrangian multiplier search is a convex problem, where in Dentzig-Wolfe and its derivatives it is usually solved with a subgradient-based algorithm. Since the total constraint matrix never has to be know in its entirety at any given point, this decomposition method quickly became popular as a way to handle problems that would otherwise be too computationally overwhelming to consider. In the late 1970s, exponential improvement in computational hardware eventually caught on to the needs of numerical optimizers, and many of the once intractable problems that required the Dentzig-Wolfe decomposition became quite manageable for a new crop of powerful machines that are becoming available. By this time the 3

status of Dentzig-Wolfe as an enabler began to fade, becoming yet another variant of the simplex method: one with the merit that it takes advantage of the aforementioned special structure in the constraint matrix. Meanwhile, over a decade of active service have earned Dentzig-Wolfe a permanent place in numerical optimization textbooks, appearing in popular works such as [Ber99b] and [NW99].
Dual decomposition

Motas algorithm relies heavily on the generalized form of Dentzig-Wolfe, typically known in literature as dual decomposition methods via Lagrangian relaxation. While Dentzig-Wolfe dealt with strictly linear objectives and constraints, their approach can be applied directly to any set of separable convex problems if the global constraint remains linear. In [Ber99a] this formulation appears directly, in fact without any local constraints whatsoever (one would assume it is tucked away neatly inside the definition of the local variable domain). One can directly go form the formulation in Bertsekas text ([Ber99a] Eq. 6.44) to BP by substituting the one-norm of x into the convex objective function.
Implementation and parallelization efforts

Although it may be an obvious motivation to contemporaries of an age where distributed computing is commonplace, Dentzig-Wolfe was in fact not directly formulated with computational parallelization in mind. Recall that the original goal is the ability to solve LP problems in a sequential fashion without requiring constant access to the entire constraint matrix. When storage became less of an issue, people began analyzing the algorithm as a divide-and-conquer attack on the superlinear computation costs of the traditional simplex method. Despite the promising theoretical numerical complexity of decomposition (almost linear given constant block size), empirical studies of its performance in solving real problems have rarely shown a decomposition algorithm outperforming straightforward simplex algorithms on a sequential machine [Him73, OH73]. Partly to blame was the lack of existing standard implementation that are on a comparable maturity level as the ones available for standard simplex, such as LP/90/94 and XLDA, without which there can be no context of iterative improvement within the optimization community. Early efforts to include decomposition as internal modifications to existing packages never solidified, despite concerted efforts form Dentzig himself 4

[Dan63]. Eventually sophisticated LP solvers started to be written in more extensible ways with relatively easy modification hooks, starting with IBMs MPSX package, which for a short while successively invigorated reinvestigations into decomposition methods, but in the end was not enough to maintain interest long enough to overcome its other problems. By far the biggest issue when evaluating the performance of Dentzig-Wolfe is its poor convergence properties. This has been empirically observed ever since its inception by every major study of its performance, including some by Dentzig himself. Mainly due to the subgradient method used in the master algorithm, Dentzig-Wolfe had been extremely consistent in generating long tails of almost correct solutions, by which point the problem either exhibits asymptotic convergence or becomes extremely ill-conditioned. Recommendations to not bother with Dentzig-Wolfe when standard approaches can do the job became the de facto conclusion amongst most of the available survey literature in the 70s and 80s. Simply put, the progress of advancement in traditional sequential LP solvers were far outpacing any performance advantages that decomposition alone provides. There were some heroic efforts from Ho and Loute [HL81, HL83] in the late 80s to use sophisticated techniques such as nested decomposition and lessons learned from standard LP implementations to boost the performance of Dentzig-Wolfe, but their studies still concluded that the even the most advanced implementations of decomposition suffered from two drawbacks inherent to subgradient methods: 1. frequent asymptotic convergence in the identification of a small subset of basis, which is not so problematic for finding approximate solutions if not for the second issue, 2. ambiguous stopping criterion in objective value, partly due to the non-decreasing property of subgradient methods, and partly due to the difficulty in distinguishing asymptotic convergence and ill-conditioned cases. This makes it hard to give meaningful error analysis of proposed intermediate solutions. The merit of decomposition would have to go outside of the realm of traditional LP to find appreciators. In the 90s the advent of interior-point methods gave decomposition new blood as it is more numerically immune to degenerate problem cases, and to this day most new developments in decomposition are applications to interior-point. Another significant contribution of Ho are some of the first comprehensive study of parallel implementations of Dentzig-Wolfe [HLS88, GH93]. Surprisingly no significant literature on distributed Dentzig-Wolfe existed prior to 1988, even though 5

it is arguably one of its most straightforward applications (perhaps explaining the lack of interest for serious studies). Even then, these studies do not provide much insight apart from showing that parallelization gives excellent performance benefits over sequential solver past some break even point in problem size. Some more recent work on both theoretical performance and practical architecture of distributed Dentzig-Wolfe can be found in [Ent89, CP99, LLL02]. The main conclusion seem to be that, as long as the convergence issue is dealt with, decomposition can be very successfully used for straightforward parallelization implementations. In this case the communication overhead was kept minimal, since each node is independently working on a separable subproblem, only communicating to a head node (solving the master problem) the current objective value and a vector on the order of the number of global constraints.

Critical analysis of Motas paper

The past few years saw the rapid popularization of basis pursuit problems due to the theoretical breakthrough of compressive sensing as well as certain applications in machine learning. It is remarkable to observe the flood of researchers trying to derive basis-pursuit reformulations of problems in their field, ranging anywhere from astronomy, medical imaging, signal processing, machine learning, seismic analysis, and optics to the more esoteric such as surface metrology and hyperspectral imaging. I will not re-iterate the importance of basis pursuit to scientists here, but suffice it to say that as more people are trying to solve BP, the more chance there will be that someone somewhere will encounter problems too large for a single computer (I am currently running into with this issue in seismic signal processing). Motas motivation in developing these distributed methods is both obvious and timely. What Mota did fail to sufficiently motivate is why there is a need for a specialized method custom tailored for BP. Early efforts to solve BP revolved around using interior-point type methods, and if that turned out to worked well then all the recent work on IPM decomposition should carry directly to a distributed BP scheme. However, it is later know that using IPM to solve BP often results in solving highly ill-conditioned systems, and it is quickly realized that more specialized method must be developed for general BP problems. Completely ignoring this bit of history, Mota went on to describe in the section titled Related works how BP can be turned into LP problems, but erroneously pointed out that existing distributed methods for LP requires a fully-connected network topology with poor scaling performance, citing a dissertation that made no references to previous work done on Dentzig-Wolfe. Clearly, from the studies men6

tioned in the previous section, an efficient distribution scheme exists via decomposition. I do not know how Motas complete silence on this body of work is possible, given the seminal status of Hos work and the fact that the first algorithm Mota proposed in the paper (and the one of interest to this report) is directly based on the dual-decomposition via Lagrangian relaxation approach, which is the generalization of Dentzig-Wolfe. Perhaps more ominously, Motas neglect for previous work on Dentzig-Wolfe also meant that its long history of established shortcoming, especially in terms of convergence issues, are neither mentioned nor taken into account. It is almost inevitable that Mota will somehow rediscover these problems, unless his BP reformulation turns out to be magically immune to these issues.
Description of Motas algorithm

The formulation of Motas algorithm, based on decomposition of bounded basis pursuit, form the most significant and well-written part of his paper (in the parts pertaining to our scope). The bounded basis pursuit (BBP) is a variant of basis pursuit with a maximum bound for the one-norm of each of its variables. In my opinion, Motas motivation for solving BBP instead of BP is actually criminally inadequate in light of how elegantly he approached it in his thesis. Section 2.1 in his thesis beautifully pointed out how the obvious dual decomposition formulation of BP will create subproblems with an unbounded objective value unless an expensive constraint is set in place. Mota appealed to two ways out of this problem: making the Lagrangian function coercive, or making the solution domain bounded vis-a-vis the Weierstra theorem. The two algorithms he introduced for BP with horizontal partitioning is a natural consequence of the bifurcation at this point. Making the Lagrangian coercive resulted in Algorithm 2 in the paper, which is suitable for a pipeline-style distribution scheme, where each node has a natural ordering and their computation depends on the results produced by the previous nodes. Since this scheme is not suitable for traditional parallelization (and outside of our scope of concern), it will not be further discussed here. The result of making the solution domain bounded is the BBP problem, which is equivalent to an instance of BP given a large enough bound R. Motas main contribution is in recognizing that the dual composition of BBP actually has an analytical solution to its subproblems, described in the subsection Solving the dual, requiring only a fixed cost to easily compute (basically calculating a matrix-vector with its associated submatrix and its transpose). The implication of this is profound, because now each node can be dumb specialized linear algebra engines programmed 7

in a SIMD (Single Instruction Multiple Data) manner, as opposed to the traditional Dentzig-Wolfe, where each node is solving its own simplex problem, requiring the much more complicated MIMD (Multiple Instruction Multiple Data) programming schemes. Furthermore, the computational cost is now easily analyzed, requiring only knowledge of the number of iterations required in the master program. The overall algorithm cost for sequential execution is almost equivalent to solving the un-decomposed BBP using standard subgradient methods, with the additional benefit of near perfect efficiency in nodal scaling (after adjustment for communication overhead). There is, however, an additional overhead in that the BBP approach only attempts to identify the basis for BP, therefore another (arguably much smaller) BP have to be solved locally on the master node. Since this is a recurring theme in solving BP, no further comments will be made on this point. An obvious critique here is how to choose an R which is larger than the infinity norm of the solution to BP, as to ensure that the solved BBP is equivalent to the original BP. Mota provided a good upper-bound on R using only A and y, which makes theoretical sense but would turn out to be the Achilles Heel of his algorithm.
Performance of Motas algorithm

As foreshadowed, Mota reported that his algorithm have serious convergence issues for some cases. Much like its predecessor the Dentzig-Wolfe, Motas algorithm displays asymptotic convergence in basis identification. While this is quite apparent in the thesis, where Mota devoted a significant portion in identifying trouble spots, in the paper form this issue is completely opaque to the reader. Given the limited space of a conference paper this is excusable, but it should at least be noted that this is a significant roadblock in the adoption of Motas algorithm. As it turns out, the bound R on the solution is both a blessing and a curse. Mota described that how close R is to the infinity norm of the BP solution ||xBP || actually has a profound effect on the convergence of this algorithm. Figure 2.10 in the thesis clearly showed that when as much as a factor of 2 is introduced in the estimate of R (in his experiments, he used a R of 3 while the actual ||xBP || is 1.48) the number of master iterations required to identify a particular support jumped from 3000 to 8000! For the numerical experiments presented in Table 1 in the paper there is actually no mention of the R used, whether an oracle value is utilized, which casts serious doubts on the validity of the results. This point is also related intimately with my main grievance towards Motas 8

0 objective function H 1 2 3 4 5 6 200 400 600 iteration 800 1000

Figure 1: Evolution of the objective function H(k ) throughout the iterations using an oracle value of R, see text. The test BBP problem has i.i.d. values of A pulled from a standard normal distribution, n = 250 and m = 50. The input signal is a sparse vector with k = 10 nonzero entires pulled i.i.d. from a standard normal distribution. The decomposition is carried out over 10 identically sized blocks along the columns. In this case R is around 3.33. The final solution detected 37 support candidates, with 9 out of 10 correctly identified support in the original signal.

thesis. While the attempts at numerical identification of convergence issues is comprehensive and commendable, there is no connection between Motas proposed way of determining R and what he actually used in the experiments. Motas proposed upper bound (and natural choice for uninformed problems) on R is n||x|| , where x is any solution of Ax = y. Ignoring for a second the fact that generating a x is quite a significant investment when youre in situations that require distributed schemes in the first place, we still have the issue that the expected value of R does not match what Mota used in his experiment! Had Mota stuck to his suggestion, his model BP problem of n = 250 where both A and the components of x are pulled from standard normal distributions, a suggested R = n||x|| should be on the order of n, which is nowhere near the value of 3 that he uses! In reality, R = 3 is more like an oracle value than an actual uninformed guess, and Mota never justified its usage. The fact that there are other serious convergence issues as outlined in the thesis even using this is troublesome. While harping on this one choice of R might give the impression of nitpicking, the fact is the weve only begun to scratch the surface on the importance of R. 9

objective function H

5 200 400 600 iteration 800 1000

Figure 2: Evolution of the objective function H(k ) throughout the iterations using a reasonable uninformed guess of R, see text. The test BBP problem is identical to the one in Figure 1, except for the value of R, which this time is around 61. The final solution detected 36 support candidates, with 9 out of 10 correctly identified support in the original signal.

As formulated, the objective value that is being minimized in Motas algorithm is expressed as T H(k ) = k gk ||x(k )||1 (2) where k and gk are only loosely couple with R (they are just consistency prices from the master algorithm), but the one-norm of the current iteration of the unknown ||x(k )||1 is, due to the analytical solution of decomposed subproblems of BBP, actually R times the number of support detected at each iteration. To illustrate the problem, Figure 1 shows a sample evolution of the objective H(k ) for a similar problem size used in Motas thesis, when R = 2||xBP || representing an oracle-type guess. Figure 2 shows the same evolution of H(k ) when R = n||xLS || , where xLS is a least-squares solution to Ax = y. Evidently, even when a completely reasonable uninformed guess is used, you have an arguably completely uninformative objective H(k ) for your problem. In this case the best bet for a stopping criterion is the current total detected support (requiring an addition matrix-vector at each node) coupled with a knowledge of the optimal cardinality in you unknown, which for many use-cases (mine included) is an unreasonable assumption.

10

Review of claims

It is evident that without further work on a way to better arrive at R, Motas algorithm can only be feasibly applied to a handful of specialized cases. To Motas credit this is mentioned in the future work section of the thesis (in fact it is the only thing mentioned, show that he understood its importance). As evident in the lack of numerical comparisons with other BP algorithms, Mota intends for this paper to be a proof of concept rather than a fully robust description of the algorithm. Since no claims on efficiency or practicality is made in the paper, the convergence issues can be overlooked. Still, my opinion is that there should be at least a mention of its convergence properties in the conference paper. The rest of Motas claims are sound: a novel approach to decomposed BP is presented, one which is suitable for distributed computing environments with very undemanding requirements both in complexity and programatic sophistication. The main contribution is the bounded BP reformulation leading to a very interesting decomposition scheme, and should provide ample opportunity for future work.

Conclusion

It looks like while historically troubled, Dentzig-Wolfe kept getting resurrected to be expanded into vast numbers of interesting derivatives, and Motas algorithm is the latest iteration on this fascinating line of pedigree. With a contemporary and timely focus on tacking large BP problems, Motas work presents itself as an amusing take on decomposition, one that certainly has the potential to find itself applied to a large range of applications should some glaring issues in convergence and the choice of R be ironed out.

References
[Ber99a] Dimitri P. Bertsekas. Decomposition methods, chapter 6.4. In [Ber99b], September 1999. [Ber99b] Dimitri P. Bertsekas. September 1999. [CP99] Nonlinear Programming. Athena Scientific,

Zhi-Long Chen and Warren B. Powell. A column generation based decomposition algorithm for a parallel machine just-in-time scheduling problem. European Journal of Operational Research, 116:220232, 1999. 11

[CRT06] E.J. Cands, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math., 59(8):12071223, 2006. [Dan63] George B. Dantzig. Linear programming and extensions. Princeton University Press, Princeton, New Jersey, 1963. [DW60] George B. Dantzig and Philip Wolfe. Decomposition principle for linear programs. Operations Research, 8:101111, 1960. [Ent89] Robert Entriken. The parallel decomposition of linear programs. Technical report, Systems Optimization Laboratory, Department of Operations Research, Stanford University, November 1989. Ph.D. thesis and technical report SOL 89-17. [GH93] S. Kingsley Gnanendran and James K. Ho. Load balancing in the parallel optimization of block-angular programs. Mathematical Programming, 62:4167, 1993. [Him73] David M. Himmelblau, editor. Decomposition of large scale problems. North Holland, Amsterdam, 1973. [HL81] James K. Ho and Etienne Loute. An advanced implementation of the dantzigwolfe decomposition algorithm for linear programming. Mathematical Programming, 20:303326, 1981. James K. Ho and Etienne Loute. Computational experience with advanced implementation of decomposition algorithms for linear programming. Mathematical Programming, 27:283290, 1983.

[HL83]

[HLS88] James K. Ho, Tak C. Lee, and R. P. Sundarraj. Decomposition of linear programs using parallel computation. Mathematical Programming, 42:391405, 1988. [LLL02] Jung Lyu, Hsing Luh, and Ming-Chang Lee. Performance analysis of a parallel dantzig-wolfe decomposition algorithm for linear programming. Computers & Mathematics with Applications, 44:14311437, 2002. [Lus] Irvin Lustig. Optimization trailblazers: Interview with Philip Wolfe. http://www.e-optimization.com/directory/trailblazers/ wolfe/decomposition_algorithm.cfm.

[Mot08] Joao F. C. Mota. Distributed algorithms for sparse approximation. Masters thesis, Technical university of Lisbon, September 2008. 12

[Mot09] Joao F. C. Mota. Distributed algorithms for basis pursuit. In SPARS09 - Signal Processing with Adaptive Sparse Structured Representations, Saint-Malo, France, March 2009. INRIA. [NW99] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, August 1999. [OH73] William Orchard-Hays. Practical problems in LP decomposition and a standardized phase I decomposition as a tool for solving large scale problems, pages 153166. In Himmelblau [Him73], 1973. [Teb01] James R. Tebboth. A Computational Study of Dantzig-Wolfe Decomposition. PhD thesis, University of Buckingham, December 2001.

13

You might also like