A Massively Parallel Algebraic Multiscale Solver for Reservoir Simulation on the GPU Architecture

Manea, A. M. (Saudi Aramco) | Almani, T. (Saudi Aramco)

OnePetro 

In this work, the scalability of the Algebraic Multiscale Solver (AMS) (Wang et al. 2014) for the pressure equation arising from incompressible flow in heterogeneous porous media is investigated on the GPU massively parallel architecture. The solvers robustness and scalability is compared against its carefully optimized implementation on the shared-memory multi-core architecture (Manea et al. 2016), which this work is directly extending. Although several components in the AMS algorithm are directly parallelizable, its scalability on GPU's depends heavily on the underlying algorithmic details and data-structures design of each step, where one needs to ensure favorable control-and data-flow on the GPU, while extracting enough parallel work for a massively parallel environment. In addition, the type of the algorithm chosen for each step greatly influences the overall robustness of the solver. Taking all these constraints into account, we have developed a GPU-based AMS that exploits the parallelism in every module of the solver, including both the setup phase and the solution phase. The performance of AMS--with our carefully optimized algorithmic choices on the GPU for both the setup phase and the solution phase, is demonstrated using highly heterogeneous 3D problems derived from the SPE10 Benchmark (Christie et al. 2001). Those problems range in size from millions to tens of millions of cells. The GPU implementation is benchmarked on a massively parallel architecture consisting of NVIDIA Kepler K80 GPU's, where its performance is compared to the multi-core CPU architecture using an optimized multi-core AMS implementation (Manea et al. 2016) running on a shared memory multi-core architecture consisting of two packages of Intel's Haswell-EP Xeon(R) CPU E5-2667. While the GPU-based AMS parallel implementation shows good scalability for the solution stage, its setup stage is less efficient compared to the CPU, mainly due to the dependence on a QR-based basis functions solver.