Faster, I say! The race for the fastest SDD linear system solver – STOC 2014 Recaps (Part 9)

In the next post in our series of STOC 2014 recaps, Adrian Vladu tells us about some of the latest and greatest in Laplacian and SDD linear system solvers. There’s been a flurry of exciting results in this line of work, so we hope this gets you up to speed.

The Monday morning session was dominated by a nowadays popular topic, symmetric diagonally dominant (SDD) linear system solvers. Richard Peng started by presenting his work with Dan Spielman, the first parallel solver with near linear work and poly-logarithmic depth! This is exciting, since parallel algorithms are used for large scale problems in scientific computing, so this is a result with great practical applications.

The second talk was given by Jakub Pachocki and Shen Chen Xu from CMU, which was a result of merging two papers. The first result is a new type of trees that can be used as preconditioners. The second one is a more efficient solver, which together with the trees shaved one more $\log n$ factor in the race for the fastest solver.

Before getting into more specific details, it might be a good idea to provide a bit of background on the vast literature of Laplacian solvers.

Typically, linear systems $Ax = b$ are easier to solve whenever $A$ has some structure on it. A particular class we care about are positive semidefinite (PSD) matrices. They work nicely because the solution is the minimizer of the quadratic form $f(x) = \frac{1}{2} x^T A x - b x$ , which happens to be a convex function due to the PSD-ness of $A$. Hence we can use various versions of gradient descent, the convergence of which depends usually on the condition number of $A$.

A subset of PSD matrices are Laplacian matrices, which are nothing else but graph Laplacians; using an easy reduction, one can show that any SDD system can be reduced to a Laplacian system. Laplacians are great because they carry a lot of combinatorial structure. Instead of having to suffer through a lot of scary algebra, this is the place where we finally get to solve some fun problems on graphs. The algorithms we aim for have running time close to linear in the sparsity of the matrix.

One reason why graphs are nice is that we know how to approximate them with other simpler graphs. More specifically, when given a graph $G$, we care about finding a sparser graph $H$ such that $L_H \preceq L_G \preceq \alpha \cdot L_H$a, for some small $\alpha$ (the smaller, the better). The point is that whenever you do gradient descent in order to minimize $f(x)$, you can take large steps by solving a system in the sparser $L_H$. Of course, this requires another linear system solve, only that this only needs to be done on a sparser graph. Applying this idea recursively eventually yields efficient solvers. A lot of combinatorial work is spent on understanding how to compute these sparser graphs.

In their seminal work, Spielman and Teng used ultrasparsifiersb as their underlying combinatorial structure, and after many pages of work they obtained a near linear algorithm with a large polylogarithmic factor in the running time. Eventually, Koutis, Miller and Peng came up with a much cleaner construction, and showed how to construct a chain of sparser and sparser graphs, which yielded a solver that was actually practical. Subsequently, people spent a lot of time trying to shave log factors from the running time, see [9], [6], [7], [12] (the last of which was presented at this STOC), and the list will probably continue.

After this digression, we can get back to the conference program and discuss the results.

How do we solve a system $L_G x = b$? We need to find away to efficiently apply the operator $L_G^+$ to $b$. Even Laplacians are not easy to invert, and what’s worse, their pseudoinverses might not even be sparse. However, we can still represent $L_G^+$ as a product of sparse matrices which are easy to compute.

We can gain some inspiration from trying to numerically approximate the inverse of $(1-x)$ for some small real $x$. Taking the Taylor expansion we get that $(1-x)^{-1} = \sum_{i \geq 0} x^i = \prod_{k \geq 0} (1+x^{2^k})$. Notice that in order to get $\epsilon$ precision, we only need to take the product of the first $\Theta(\log 1/\epsilon)$ factors. It would be great if we could approximate matrix inverses the same way. Actually, we can, since for matrices of norm less than $1$ we have the identity $(I-A)^{-1} = \prod_{k\geq 0}(I + A^{2^k})$. At this point we’d be tempted to think that we’re almost done, since we can just write $L_G = D - A$, and try to invert $I - D^{-1/2} A D^{-1/2}$. However we would still need to compute matrix powers, and those matrices might again not even be sparse, so this approach needs more work.

Richard presented a variation of this idea that is more amenable to SDD matrices. He writes

$(D-A)^{-1} = \frac{1}{2}( D^{-1} + (I+D^{-1}A)(D - AD^{-1}A)^{-1}(I+AD^{-1} ) )$

The only hard part of applying this inverse operator $(D-A)^{-1}$ to a vector consists of left multiplying by $(D-AD^{-1}A)^{-1}$. How to do this? One crucial ingredient is the fact that $M = D-AD^{-1}A$ is also SDD! Therefore we can recurse, and solve a linear system in $M$. You might say that we won’t be able to do it efficiently, since $M$ is not sparse. But with a little bit of work and the help of existing spectral sparsification algorithms $M$ can be approximated with a sparse matrix.

Notice that at the $k^{th}$ level of recursion, the operator we need to apply is $\left(D-(D^{-1}A)^{2^k}\right)^{-1}$. A quick calculation shows that if the condition number of $D-A$ is $\kappa$, then the condition number of $D^{-1}A$ is $1-1/\kappa$. This means that after $k = O(\log \kappa)$ iterations, the eigenvalues of $(D^{-1}A)^{2^k}$ are close to $0$, so we can just approximate the operator with $D^{-1}$ without paying too much for the error.

There are a few details left out. Sparsifying $M$ requires a bit of understanding of its underlying structure. Also, in order to do this in parallel, the authors originally employed the spectral sparsification algorithm of Spielman and Teng, combined with a local clustering algorithm of Orecchia, Sachdeva and Vishnoi. Blackboxing these two sophisticated results might question the practicality of the algorithm. Fortunately, Koutis recently produced a simple self-contained spectral sparsification algorithm, which parallelizes and can replace all the heavy machinery in Richard and Dan’s paper.

Jakub Pachocki and Shen Chen Xu talked about two results, which together yield the fastest SDD system solver to date. The race is still on!

Let me go through a bit of more background. Earlier on I mentioned that graph preconditioners are used to take long steps while doing gradient descent. A dual of gradient descent on the quadratic function is the Richardson iteration. This is yet another iterative method, which refines a coarse approximation to the solution of a linear system. Let $L_G$ be the Laplacian of our given graph, and $L_H$ be the Laplacian of its preconditioner. Let us assume that we have access to the inverse of $L_H$. The Richardson iteration computes a sequence $x_0, x_1, x_2, \dots$, which converges to the solution of the system. It starts with a weak estimate for the solution, and iteratively attempts to decrease the norm of the residue $b- L_G x_k$ by updating the current solution with a coarse approximation to the solution of the system $L_G x = b -L_G x_k$. That coarse approximation is computed using $L_H^{-1}$. Therefore steps are given by

$x_{k+1} = x_k + \alpha \cdot L_H^{-1} ( b - L_G x_k )$

where $\alpha \in (0,1]$ is a parameter that adjusts the length of the step. The better $L_H$ approximates $L_G$, the fewer steps we need to make.

The problem that Jakub and Shen talked about was finding these good preconditioners. The way they do it is by looking more closely at the Richardson iteration, and weakening the requirements. Instead of having the preconditioner approximate $L_G$ spectrally, they only impose some moment bounds. I will not describe them here, but feel free to read the paper. Proving that these moment bounds can be satisfied using a sparser preconditioner than those that have been so far used in the literature constitutes the technical core of the paper.

Just like in the past literature, these preconditioners are obtained by starting with a good tree, and sampling extra edges. Traditionally, people used low stretch spanning trees. The issue with them is that the number of edges in the preconditioner is determined by the average stretch of the tree, and we can easily check that for the square grid this is $\Omega(\log n)$. Unfortunately, in general we don’t know how to achieve this bound yet; the best known result is off by a $\log \log n$ factor. It turns out that we can still get preconditioners by looking at a different quantity, called the $\ell_p$ stretch ($0 < p < 1$), which can be brought down to $O(\log^p n)$. This essentially eliminates the need for computing optimal low stretch spanning trees. Furthermore, these trees can be computed really fast, $O(m \log \log n)$ time in the RAM model, and the algorithm parallelizes.

This result consists of a careful combination of existing algorithms on low stretch embeddings of graphs into tree metrics and low stretch spanning trees. I will talk more about these embedding results in a future post.

a. $\preceq$ is also known as the Lowner partial order. $A \preceq B$ is equivalent to $0 \preceq B-A$, which says that $B-A$ is PSD.
b. A $(k, h)$-ultrasparsifier $H$ of $G$ is a graph with $n-k+1$ edges such that $L_H \preceq L_G \preceq h \cdot L_H$. It turns out that one is able to efficiently construct $(k, n/k\ \mathrm{polylog}(n))$ ultrasparsifiers. So by adding a few edges to a spanning tree, you can drastically reduce the relative condition number with the initial graph.

[1] Approaching optimality for solving SDD systems, Koutis, Miller, Peng (2010).
[2] Nearly-Linear Time Algorithms for Graph Partitioning, Graph Sparsification, and Solving Linear Systems, Spielman, Teng (2004).
[3] A Local Clustering Algorithm for Massive Graphs and its Application to Nearly-Linear Time Graph Partitioning, Spielman, Teng (2013).
[4] Spectral Sparsification of Graphs, Spielman, Teng (2011).
[5] Nearly-Linear Time Algorithms for Preconditioning and Solving Symmetric, Diagonally Dominant Linear Systems, Spielman, Teng (2014).
[6] A Simple, Combinatorial Algorithm for Solving SDD Systems in Nearly-Linear Time, Kelner, Orecchia, Sidford, Zhu (2013).
[7] Efficient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems, Lee, Sidford (2013).
[8] A nearly-mlogn time solver for SDD linear systems, Koutis, Miller, Peng (2011).
[9] Improved Spectral Sparsification and Numerical Algorithms for SDD Matrices, Koutis, Levin, Peng (2012).
[10] Simple parallel and distributed algorithms for spectral graph sparsification, Koutis (2014).
[11] Preconditioning in Expectation, Cohen, Kyng, Pachocki, Peng, Rao (2014).
[12] Stretching Stretch, Cohen, Miller, Pachocki, Peng, Xu (2014).
[13] Solving SDD Linear Systems in Nearly mlog1/2n Time, Cohen, Kyng, Miller, Pachocki, Peng, Rao, Xu (2014).
[14] Approximating the Exponential, the Lanczos Method and an $\tilde{O}(m)$-Time Spectral Algorithm for Balanced Separator, Orecchia, Sachdeva, Vishoni (2012).