Hierarchical matrices (abbreviated as
Over the last years, there have been a number of research efforts to develop mature libraries of linear algebra operations for Visit
In this paper we pursue the development of a prototype task-parallel algorithm for the solution of hierarchical symmetric positive definite (SPD) linear systems via the
In [2] we presented a prototype parallel implementation of the
The OmpSs programming model has been applied to extract task-parallelism for the LU factorization of dense matrices in [3], for the
The rest of the paper is structured as follows. In Section 2 we provide a short review on the mathematical foundations of
In this section, we briefly review a few basic concepts related to
Let
Let
where
In order to obtain the hierarchy of blocks that defines the
Figure 1 shows an example of a block cluster tree consisting of leaves that conform a partition of
The collection of low-rank matrices with maximal rank
The set of
where
Following with the example in Figure 1, the block cluster tree defined in that example yields the partitioning of a matrix of size 12
The abstract concept underlying
In addition,
In this section we initially provide a definition for the Cholesky factorization. Next, we present a blocked algorithm that will be later evolved in order to obtain a procedure for the decomposition of hierarchical matrices.
The Cholesky factorization of an SPD matrix
There exist three algorithmic variants (procedures) to compute the Cholesky factorization [5]. We next review the blocked right-looking (BRL) procedure, which underlies the scheme to compute the
Let us assume that the
Next, it solves
and then it performs the update of the trailing sub-matrix:
Note that, because of the symmetry of the matrix, it is not necessary to compute the first row of blocks of
An inspection of the operations that compose the BRL procedure exposes a collection of data dependencies between the blocks of
The BRL algorithm for the Cholesky factorization specifies a unique ordering of the operations, different from that dictated by the two other algorithmic variants for this factorization [5]. However, this schedule can be modified, provided the data dependencies are fulfilled, while still computing the same result. For example, the solution of the
BRL algorithm for the Cholesky factorization.Require: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:
We next present a procedure for the
Sequence of operations for the O1 : O2 : O3 : O4 : O5 : O6 : O7 : O8 : O9 : O10 :
The operations in this sequence correspond to three basic linear algebra building blocks (or computational kernels):
Cholesky factorization: O1, O4 O7, O10; triangular system solve with transposed lower triangular factor: O2, O5, O8; and matrix-matrix multiplication, which can be specialized in the form of symmetric rank-
These kernels are also present in the BRL Cholesky factorization procedure introduced in Algorithm 1. The only difference is that some of the kernels in the
Figure 4 provides a graphical representation of the
O1 − O3 | O4 | O5 − O6 | O7 − O9 | O10.
that appeared separated with horizontal lines in the sequence of operations for the
We note that if any of the matrix blocks of
In this section we describe three approaches for the parallel computation of the
The BRL algorithm decomposes the Cholesky factorization into a collection of four types of basic computational kernels operating on the matrix blocks: Cholesky factorization (of a diagonal block
This first approach, based on a multi-threaded BLAS and transparent to the programmer (as it does not require any change in the original code), will only deliver a substantial fraction of the machine peak performance in case the matrix blocks involved in the operations performed inside the BLAS are sufficiently large with respect to the number of cores of the platform. Unfortunately, for hierarchical matrices, many of the blocks exhibit a reduced dimension, constraining the parallel efficiency of this simple approach.
The BRL algorithm consists of three nested loops, indexed by variables
The parallelization of the second case, corresponding to the loops in lines 6–10, is slightly more complex because of the nested organization of the loops. Here, these loops cannot be tackled by using a collapse clause that internally transforms them into a single loop, because of the dependency between the counter of the innermost loop on the outermost one (index
In summary, this second approach provides an easy-to-program strategy to parallelize the BRL procedure (as well as the corresponding generalization to hierarchical matrices). However, the performance will depend on the number of iterations of the loops and may be constrained because, at execution, each loop parallelized in this manner implicitly requires a synchronization point for the threads upon completion.
The third parallelization approach is the most sophisticated strategy but also the option that can be expected to render higher performance. The idea is to rely on the “sequential” BRL procedure in order to identify the real data dependencies between the operations of the algorithm. Each one of these operations becomes then a task/node and the dependencies act as vertices connecting pairs of nodes of a task dependency graph (TDG). This graph defines a partial order that specifies multiple correct orderings for the operations. Concretely, it depends only on the actual dependencies of the mathematical operation but not on the algorithm itself. For example, when applied to all three known variants for the computation of the Cholesky factorization, the result is the same TDG. In other words, the execution of the operations comprised by the Cholesky factorization, executed in any order that fulfills the dependencies registered in the TDG, will produce the same correct solution as a sequential run (except for small variations due to the use of finite-precision arithmetic and rounding error).
This third approach aims to maximize the degree of concurrency exposed to the hardware, which offers multiple (correct) alternatives in the sequencing of the operations involved by the factorization. In principle, exploiting this type of parallelization approach may seem complex, because it requires to encode the algorithm, discover the data dependencies between operations (preferably, at execution time), and keep track of the order in which they are issued for execution. Fortunately, this burden can be greatly alleviated by employing a proper parallel programming model, assisted by a runtime with support for task-parallelism. This is the case of OmpSs, which was originally conceived to exploit this type of parallelism, and also for releases 3.0 and higher of OpenMP. The mechanism to automatically detect dependencies, the implications from the point of view of decomposing tasks into finer-grain sub-operations, and the consequences of the data storage layout are all discussed in [2], which was focussed on the implementations details.
All the experiments in this section were performed using IEEE double precision arithmetic, on a server equipped with two Intel E5-2603v3 sockets, each with a 6-core processor (1.6 GHz), and 32 Gbytes of DDR3 RAM. Our codes were linked with Intel MKL (composer_xe_2011_sp1
) for the BLAS kernels and the Cholesky factorization, and OmpSs (version 16.06).
As argued at the beginning of this paper and following previous works (see [2]), our goal is to expose the performance benefits of leveraging a task-parallel programming model such as OmpSs and OpenMP for the solution of linear algebra operations on
The
Configurations for the experimental evaluation of the Block granularity in each level 5,000 2 5,000, 100 10,000 2 10,000, 500
Figures 5 and 6 report the GFLOPS (billions of floating point operations per second) rates attained by the different parallel approaches for each of the matrix configurations on the Intel server using 4, 8 and 12 threads/cores. These performance rates take into account that some of the blocks may be null when calculating the actual flops necessary to factorize each matrix. A higher GFLOPS rate thus indicates higher performance/shorter execution time. The parallel variants evaluated there include:
MKL Multithread: Parallel algorithm that exploits parallelism inside the BLAS kernels invoked during the execution of the BRL variant of the hierarchical Cholesky factorization; see subsection 4.1.
OpenMP - Simple: Loop-parallel algorithm that exploits the loop-parallelism present in the BRL variant of the hierarchical Cholesky factorization; see subsection 4.2.
OmpSs and OpenMP - Tasks: Task-parallel algorithms that exploit the parallelism defined by the TDG associated with the hierarchical Cholesky factorization; see subsection 4.3.
The results of this evaluation offer some general conclusions:
The GFLOPS rates observed with the four parallelization approaches grow with the number of cores for all tested configurations: problem size, number of levels of the hierarchical matrix, and sparsity (dispersion); but there are notable differences depending on the specific approach.
In general, the task-parallel versions (OmpSs and OpenMP - Tasks) outperform OpenMP - Simple or MKL Multithread variants. Furthermore, when the number of tasks is small, the OpenMP task-parallel implementation offers higher parallel performance than the task-parallel version that relies OmpSs. For larger matrices and, when the number of tasks grows, the differences between the two task-parallel solutions is reduced.
In most cases the parallel performance is slightly reduced as the amount of null blocks is increased in comparison to denser configurations.
We have developed and evaluated several parallel algorithms for the solution of hierarchical SPD linear systems, via the
As part of future work, we intend to analyze the possibility of extracting concurrency from a combination of two or more of these complementary mechanisms, and to shift our parallelization effort to work on a mature library for