Basics on parallel computing#

The development of workstations and High Performance Computing (HPC) systems has benefited from Moore’s law for a long period. At this time, the computing power of microprocessors was doubling every two years thanks to the densification of transistors embedded in integrated circuits. This period of “unicore free lunch” ended in approximately 2005 and the developement of hardware entered in the post Moore era. This new period is still dominated by increasing requirements toward both performance and energy efficiency. As an answer, recent advances in hardware during the last decade have lead to increasingly heterogeneous architectures featuring an increasing number of nodes and cores and the increasing use of specific computing components such as GPUs and other accelerators. This pattern is still continuing, leading to more and more complex hardwares.

Current state-of-the-art HPC system involves a cluster of computing nodes connected by a high throughput and low latency network relying, e. g. on the infiniband technology or an equivalent one. Each computing node has usually several processors which embedded several computing cores. Finally, recent processor cores have the ability to run the same set of instruction on multiple data, which is also denoted as vectorization. As a result, such type of hardwares enable various levels of parallelism : at the core level (vectorization), at the node level (between cores) and, finally, at the cluster level (between nodes). The highest the level is, the lowest the speed of data exchange is. Such a multiscale feature is also noticeable regarding memory, which is increasingly hierarchical, involving several level of fast cache units between cores the main SDRAM memory unit and the emergence of intermediate storage components such as NVRAM burst buffers located between SDRAM and the storage areas. Those last components are generally of two types: local to the computing node for scratch data, or distributed across the network. Thus, in order to exploit the tremendous parallel capabilities of those new architectures, special care should be provided regarding task independence and data locality.

There is basically two reasons why modern Finite Element software should efficiently handle parallelism on those type of parallel architecture.

The first reason is, obviously, to accelerate the resolution of simulations on both single workstations and computing cluster by taking advantage of all the available computing units. Here, the efficiency of parallelism may be observed in terms of strong scalability. In the ideal case, the total computing time should be divided by the number of independent computing units used.
The second reason is to spread memory usage over several computing nodes. As a matter of fact, the requirement for increasingly fine scale simulations leads to numerical problems which size usually exceeds the available SDRAM memory on a single computing node. This happens e. g. in material science when studying the influence of morphologic details on the global behavior of a component. Here, the efficiency of parallelism is mainly observed in term of weak scalability. In the ideal case, the total computing time should remain constant when adding computing ressources in proportion to the increase of the problem size.

In view of the foregoing, there are basically three strategies toward parallelization.

The fist strategy is known as data parallelism, vectorization or SIMD (Single Instruction Multiple Data). It takes advantage of the capability of modern processors to perform the same operation simultaneously on multiple data, thanks to special set of instructions (AVX512 being the last one on x86 processors). Generated at compilation time, this type of parallelism is is transparent for the user but strongly depends on the target processor capabilities. Currently, Z-set makes little use of this approach, except through the use of specific purpose library (such as the eigen linear algebra library used in the new implementation of the domain decomposition solvers, see http://eigen.tuxfamily.org).
The second strategy is known as shared memory parallelism or multithreading. Here, specific tasks sharing the same data within a given process (that is one instance of the code running) are parallelized at the core level using lightweight processes a.k.a. threads. This type of parallelism does not require an initial splitting of data, though a correct ordering is of importance in order to reach optimal performances. In Z-set, this feature is available during the material integration phase and one some linear solvers such as MUMPS, Dissection and the domain decomposition solvers. See page for more details on how to run such a computation.
The third strategy is known as distributed memory parallelism. Here, the computation is distributed on several computing nodes. Multiple instance of the code are launched in parallel with inter-communication, usually thanks to the MPI (Message Passing Interface) protocol. This type of parallelism require an initial partitioning of data in order to localize it on each node prior to running the computation. In Z-set, this feature is available in conjunction with the domain decomposition solvers or with the distributed interface of the MUMPS sparse direct solver. See page for more details on how to run such a computation.

Today, achieving massive parallelism on the recent multicores architecture require to mix those three strategies in order to achieve a multilevel parallelism. Thus, typical parallel computation will involve several instances (or processes) of the code running in parallel distributed mode through an MPI communicator and acting on their subset of data. Then, each instance will use its available cores on the compute node for shared memory parallelism through the use of threads. Finally, each thread may use some vectorization capability on its core depending on the target processor.

This guide aims at presenting the way to achieve parallelism within Z-set. It is organized as follow. The remaining of this part is devoted to the description of the architecture of Z-set regarding distributed memory parallelism. Preprocessing part, mainly consisting in the splitting of the input mesh prior to running a distributed parallel computation, is presented here. The way of running parallel shared and distributed memory computation within Z-set together with the documentation of the available solvers, is presented in solver. Some of the post-processors available in Z-set can be done in parallel. Section post shows how to perform the post-processing in parallel.