Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

HPC Nyström

maurice.gauche
November 2023

I repeated during this project one error, which is writing Nylstrom instead of Nystrom.

In the zip files can be found:


- CGS for Nylstrom.py : this is the implementation of a classical Gram schmidt adapted to my implementa-
tion of the Nystrom Algorithm.
- DataGenerationProject2.py : contains function permitting the build of artificial data.
- Nylström Algo Parallel.py : contains my parallelized algorithm.
- Simplethings.py : contains some function in particular the error function.
- SRHTBlock.py : contains a function building in parallel the skecthing matrix. - UsingProcessor square.py
: contains functions distributing and gathering A matrix on a square distribution of processor. - The figure
files contains the code I used for my plots. - mainproject2.py contains a simple code which can be run in
parallel to test my implementation of the algorithm.

1 Presentation of the randomized Nyström algorithm


Suppose we are given a matrix A ∈ Rn×n , that is symmetric positive semidefinite, and a sketching matrix
Ω ∈ Rn×l . Then we will use the following algorithm (it is the same as the one given during the lecture of the
31 October [1], but we changed the Cholesky decomposition to an Eigenvalue decomposition):

Algorithm 1 Nyström low rank approximation algorithm


Require: A ∈ Rn×n (SPSD), and Ω ∈ Rn×l
Compute C = AΩ
1 1
Compute B = Ω⊤ C and it’s eigenvalue decomposition S∆S ⊤ = S∆ 2 ∆ 2 S ⊤
1
Let L = S∆ 2 such that B = LL⊤
1
Compute Z = CS(∆ 2 )+ = CL−⊤ ▷ L−⊤ might be ill defined
Compute QR = Z the QR factorization of Z.
Compute the truncated rank-k SVD of R as Uk Σk Vk⊤
Compute Ûk = QUk
Output [[AN yst ]]k = Ûk Σ2k Ûk⊤

This algorithm is motivated by the following equality’s:

AN yst = (AΩ)(Ω⊤ AΩ)+ (Ω⊤ A)


= CB + C ⊤
= C(S∆S ⊤ )+ C ⊤
= C(S∆+ S ⊤ )C ⊤
1 1
= CS(∆+ ) 2 (∆+ ) 2 S ⊤ C ⊤
= ZZ ⊤ = QRR⊤ Q⊤
= QU ΣΣU ⊤ Q⊤ = Û Σ2 Û ⊤

1
Remark: Here we take R = U ΣV ⊤ as the SVD (not truncated) and Û = QU is orthogonal.

This shows that the SVD of AN yst is Û Σ2 Û ⊤ .This is the reason why the rank-k truncated approximation
of AN yst do coincide with Ûk Σ2k Ûk as defined in Algorithm 1.

2 Choice of the sketching matrix


As recommended we will use the Block SRHT to generate the sketching matrix Ω ∈ Rn×l

Ω⊤ = [Ω⊤ ⊤ ⊤
(1) Ω(2) ... Ω(P ) ] (1)

where Ω⊤ n
p
(i) = P l DLi RHDRi , with:

• DLi ∈ Rl×l and DRi ∈ Rn/P ×n/P are diagonal with independent random signs.
• H ∈ Rn/P ×n/P is a normalized Walsh-Hadamard matrix
• R ∈ Rl×n/P is a uniform sampling matrix

The choice of the block Walsh-Hadamard matrix is based on following result seen during the 24th of
October’s lecture [2]:
Ω as given in (1) is OSE(m,ϵ,δ), when l = O(ϵ−2 (m + ln nδ ) ln m
δ )

3 Numerical stability
We will run our tests on n = 1024. Two data matrices will be used to test our algorithm as given in [3]: the
Polynomial Decay matrices and the Exponential Decay matrices with parameter R, q and p.

We first consider the Nyström approximation without k-rank truncation (we force k=l) and obtain the
following plots in Figure 1 and in Figure 2. These are the computation of the relative norm of the Nyström
low rank approximation (solid line) and the rank-k truncation (dashed line) with the regard to the nuclear
norm.
Figure 1 shows that for polynomial decay the relative error for the Nyström algorithm with SRHT sketching
matrices follows on a log scale the general direction of the k-Rank truncation.
Figure 2 shows that for a ’slow’ exponential matrix the the Nylström Approximation behave badly with
regard to the Rank-k truncated approximation if the exponential parameter q is small (with regards to 1).

Figure 1: Relative error with regard to the nuclear norm for the Nyström Approximation (solid lines) and
Rank k-Truncation (dashed lines) for Polynomial decay matrices

2
Figure 2: Relative error with regard to the nuclear norm for the Nyström Approximation (solid lines) and
Rank k-Truncation (dashed lines) for Exponential decay matrices

Figure 1 and 2 shows that there is room for improvement if we choose some k, and we expect it to be
filled when l increase. for this we compute a different relative error, which we call the k-rank relative error:

∥A − [[AN yst ]]k ∥∗


−1
∥A − Ak ∥∗

We fix k = 10 and n = 1024 and the plots are given in Figure 3 and 4. These plots shows that increasing l
make our algorithm more accurate. In particular we get that for small value l (for example 50) we already
achieve good approximation with a relative error of (less then 10−1 )
From the bound for accuracy for the Nyström Algrithm from lecture [1] we get that:
if l = O(k log nδ 2 ) then ∥A − [[AN yst ]]k ∥∗ ≤ 4 ∗ ∥A − [[A]]k ∥∗ holds with probability at least 1 − 2δ. We tried
to investigate theoretically the constant behind l = O(k(log nδ )2 ) by reading paper [4]. Yet it would seem
that computing l with Theorem 2.1 of [4] such that Ω is OSE( 31 , δ, k) and OSE( dϵ , Nδ , 1) is a problem for
p
’small’ value of n (that is of order less then 105 for example) since then l ≥ n is not usable.
Yet we will try to use l = O(k(log nδ )2 ) by trying different constant c such that l = c k(log nδ )2 . From Figure
3,4 we can estimate c by taking 50 = l = c k log( nδ )2 . We obtain c ≈ 20.

∥A − [[AN yst ]]k ∥∗


Figure 3: Plot of − 1 for k = 10, N = 1024. (Polynomial decay matrices)
∥A − Ak ∥∗

3
∥A − [[AN yst ]]k ∥∗
Figure 4: Plot of − 1 for k = 10, N = 1024. (Exponential decay matrices)
∥A − Ak ∥∗

4 Parallelization of the Nyström low rank approximation with


SRHT sketching matrix
√ √
We present Algorithm 2, which is a parallelized version of Algorithm 1. We have P = P× P processor.
This algorithm does essentially the following:
- Computes C and B in parallel. √
- Compute Z locally, that is Zi i ∈ {1, ...., P } (Zi is a block row distribution).
- Compute the parallelized QR-decomposition of Z.
- Compute the truncated rank-k SVD of R.
- Compute locally Ûk

Algorithm 2 Parallelized Nyström low rank approximation algorithm


Require: A ∈ Rn×n (SPSD), and Ω ∈ Rn×l
Computation of C = AΩ and B = ΩC
Distribute A among processors by using a two dimensional block distribution.
Distribute Ω among processor using a row distribution
The processor Pij owns the block √ Aij √
For all processors Pij , i = 1 to P , j = 1 to P in parallel do
Compute Cij = Aij Ωj
P √P
Sum-reduce to compute Ci = j=1 Cij among processor in a same row
Compute Bi = Ω⊤ i Ci
End For P√P
Sum-reduce to compute B = i=1 Bi
1 1
Then compute the eigenvalue decomposition of B = S∆S ⊤ = S∆ 2 ∆ 2 S ⊤
1
Let L = S∆ 2 such that B = LL⊤
Here we work on the column of processor which owns the Ci .( Pi1 )
1
Compute Z locally on each processor that is Zi = Ci S(∆ 2 )+ = Ci L−⊤ ▷ L−⊤ might be ill defined
Compute a parallelized QR-decomposition of Z which gives Qi R = Zi
Compute the truncated rank-k SVD of R as Uk Σk Vk⊤
Compute locally
√ Ûk = QU√k as (Ûk )i = Qi Uk
For i = 1 to P , j = 1 to P
Output [[AN yst ]]k ij = (Ûk )i Σ2k (Ûk⊤ )j ▷ Each processor output a block of [[AN yst ]]k
End For

4
5 Presentation of the runtime of the Randomized Nyström low
rank approximation (without parallelization)
We plot the runtime of our algorithm for R = 10, p = 1 and n = 1024, 4096 (R and p do not play a role
in the runtime) in Figure 5, we also plot a supposed constant line, which is the running time for the rank-k
truncated approximation.

Figure 5: Plot of The running time for different value of k .()

Clearly if the runtime of our algorithm is bigger then the ones for simple k-rank truncation then there is
no advantage to use our algorithm since it will be less precise and slower. Our result seems independant of
k, this probably arise because my python implementation of k-rank truncation is not optimal, since I first
use the numpy.linalg.svd() function and then reduce the rank to k.
The intersection between the running time line of our Algorithm and the one from the rank-k truncation is
increasing (with regard to n) in proportion. Since k log( nδ )2 is decreasing in proportion with regard to n. We
can take l = c k log( nδ )2

6 Presentation of parallelization runtime


While investigating my runtime with parallelization I remarked that the generation distribution of my A
matrix on the processors is very cost effective as is the final gathering of A.
Acceleration only occured for very high n such as n = 212 I couldn’t test for higher value.
Here are our result.

References
[1] Laura Grigori Randomized algorithms for low rank matrix approximation, Lecture of the 31 October 2023.
[2] Laura Grigori Introduction to randomization and sketching techniques, Lecture of the 24 October 2023.

[3] J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher. Fixed-rank approximation of a positive-semidefinite


matrix from streaming data. Advances in Neural Information Processing Systems, 2017
[4] Balabanov, O., Beaupere, M., Grigori, L., and Lederer, V. Block subsampled randomized hadamard
transform for low-rank approximation on distributed architectures, 2022

You might also like