In this work, the scalability of the Algebraic Multiscale Solver (AMS) (Wang et al. 2014) for the pressure equation arising from incompressible flow in heterogeneous porous media is investigated on the GPU massively parallel architecture. The solvers robustness and scalability is compared against its carefully optimized implementation on the shared-memory multi-core architecture (Manea et al. 2016), which this work is directly extending. Although several components in the AMS algorithm are directly parallelizable, its scalability on GPU's depends heavily on the underlying algorithmic details and data-structures design of each step, where one needs to ensure favorable control-and data-flow on the GPU, while extracting enough parallel work for a massively parallel environment. In addition, the type of the algorithm chosen for each step greatly influences the overall robustness of the solver. Taking all these constraints into account, we have developed a GPU-based AMS that exploits the parallelism in every module of the solver, including both the setup phase and the solution phase. The performance of AMS--with our carefully optimized algorithmic choices on the GPU for both the setup phase and the solution phase, is demonstrated using highly heterogeneous 3D problems derived from the SPE10 Benchmark (Christie et al. 2001). Those problems range in size from millions to tens of millions of cells. The GPU implementation is benchmarked on a massively parallel architecture consisting of NVIDIA Kepler K80 GPU's, where its performance is compared to the multi-core CPU architecture using an optimized multi-core AMS implementation (Manea et al. 2016) running on a shared memory multi-core architecture consisting of two packages of Intel's Haswell-EP Xeon(R) CPU E5-2667. While the GPU-based AMS parallel implementation shows good scalability for the solution stage, its setup stage is less efficient compared to the CPU, mainly due to the dependence on a QR-based basis functions solver.
Reverse Time Migration (RTM) and Full Waveform Inversion (FWI) are some of the most critical and intensive algorithms in the processing workflow. They involve temporal cross-correlation of forward and adjoint states at the same time and, therefore require saving the forward states in memory. Checkpointing is implemented to trade memory usage with data movement and computations. The increased data movement is specially detrimental to the performance of Graphical Processing Units (GPU) where data transfers are much slower compared to compute. Moreover, limited GPU memory necessitates more frequent transfers and effective GPU utilization is lowered because GPU waits to finish data copy before resuming computing. This lowers their effective performance when solving adjoint problem and delays the time-to-solution of RTM/FWI workflows. We propose a two-level checkpoint formulation for GPUs using asynchronous compute and Non Volatile Memory Express (NVMe) systems which hides all data movement overhead and enables continuous GPU usage without waiting for data transfer. The parameters of the checkpointing formulation are generalizable to multiple system and any RTM/FWI formulations using bandwidth and throughput values. Implementing optimized data transfer approaches leads to faster compute time with increased GPU utilization. We demonstrate our results using an acoustic RTM formulation.
Presentation Date: Wednesday, October 17, 2018
Start Time: 1:50:00 PM
Location: Poster Station 18
Presentation Type: Poster
In the current study, a simulation of the interaction between the three-dimensional dam-break wave and the vertical square column is carried out by using the MPSGPU-SJTU solver. The simulation conditions are arranged according to the experiments performed by Yeh and Petroff (2006). The results of GPU solver are compared to other researches. The evolution procedure of three-dimensional dam-break wave, including the climb, fragmentation and rollover of free surface is presented in this paper. In the process of dam-break wave and vertical square column interaction, the net force exerted on the column is monitored and in good agreement with existing experimental data. A remarkable speedup is obtained by comparing the calculation time of the GPU solver with that of the CPU version. The effect of bottom water layer is investigated. The result shows a significant difference between flow phenomenon with and without water layer.
The impact of waves on structures is an important problem in ship and ocean engineering, including nonlinear wave surface evolution, wave climbing and slapping on structures, and severe deformation or even fragmentation of free surface under the effect of structures. In recent years, the mesh-free method MPS has gained popularity for modeling free surface flows, and it has become an alternative to traditional mesh-based methods for modeling waves. Owing to the Lagrangian nature of the mesh-free method, there is no need to deal with the free surface when it is applied to simulate nonlinear free surface flows, especially when the surface tension is not important. This property makes it particularly attractive to modeling water waves, e.g., dam-break (Zhang et al., 2011), sloshing (Yang et al., 2015), water entry (Chen et al., 2017).
The earlier MPS method was limited to the two-dimensional flow problem. This is because of the large amount of calculation of MPS method, the calculation of three-dimensional problem requires a large number of particles. In order to improve the efficiency of MPS method, researchers have two main ideas: one is the method of local encryption of particles, using fewer particles to obtain better simulation results, such as multi-resolution particle method (Tang et al., 2016), overlapping particle method (Shibata et al., 2012). Another kind of parallel algorithm is divided into two kinds from the hardware environment: one is the parallel method based on CPU environment (Ikari and Gotoh, 2008, Iribe et al., 2010), the other is the parallel method based on GPU. Zhu et al. (2011) developed different versions of MPS code based on different GPU memories. Hori et al. (2011) used CUDA (Compute Unified Device Architecture) language to develop a GPU-accelerated MPS code and only acquired about 3-7 acceleration ratio by simulating two-dimensional (2-D) dam break. Li et al. (2015) applied GPU acceleration technique to two parts of MPS, neighbor particle list and pressure Poisson equation. By simulating 3-D dam break and sloshing, the speedup of these two parts is about 1.5 and 10, respectively. Gou et al. (2016) used GPU accelerated MPS to simulate the isothermal multi-phase fuel-coolant interaction.
For modern day reservoir simulators, it is essential to provide realistic physical description of reservoirs, fluids and hydrocarbon extraction technology and guarantee excellent performance and parallel scalability.
In the past, the advances in simulation performance were largely limited by memory throughput of CPU based computer systems. Recently, new generation of graphical processing units (GPU) became available for general purpose computing with the support of double precision floating point operations, necessary for dynamic reservoir simulations. The graphical cards currently available on the market have thousands of computational cores that can be efficiently utilized for simulations.
In this paper, for the first time we present results of running full physics reservoir simulator on CPU+GPU platform and discuss implications of this modern technology on the existing reservoir simulation workflows. We discuss challenges and developed solutions for running reservoir simulations using modern CPU+GPU hardware architecture and propose a methodology to distribute the workload between various parts efficiently. The approach is tested on several data sets on various computational platforms, such as personal computers and clusters with and without GPU's involved.
The technology proposed in this paper demonstrates multifold speed up for models with substantial number of active grid blocks. The speed up due to GPU utilization can in some cases reach as high as 3-4 times compared to the traditional GPU-based approach. Considering the recent progress in the GPU development, this factor is expected to grow in the near future, and the hybrid CPU+GPU based approach allows to utilize the exciting potential of the hardware evolution. The results, advances and potential bottlenecks combined with detailed analysis of the performance and the ‘value for money’ of the modern hardware solutions are discussed.
Development of a natural gas field by depletion drive is characterized by a decrease of average formation pressure, bottom hole and wellhead pressures of production wells over time. As a rule, during the initial field development period available formation pressure is sufficient for gas transportation from wellheads to a treatment facility and then to a tie-in to the trunk gas pipeline without using the compressor equipment. However, the formation pressure is decreasing gradually during the whole field life and this in turn leads to a decrease in the entire system "Reservoir - Well - Infield Gas Collection Networks - Gas Treatment Plant". There comes a point of time when the pressure of gas at the gas treatment plant (GTP) inlet becomes insufficient for its treatment and further supply into the trunk gas pipeline under the required pressure and flow rate.
Thus, from the point of view of infrastructure development and production technology the process of development of gas and gas condensate fields is divided into two periods: natural pressure and artificial lift production. The difference between these two periods lies in usage of a compressor unit designed for increase of the produced gas pressure to values required both at inlet of the treatment plant (in order to ensure the working parameters of the gas treatment process) and for further supply into the trunk gas pipeline. This unit is called a booster compressor station (BCS). As a rule, it is located at the site adjacent to the GTP site ( Compression of gas for its downstream transportation; Maintaining of the required gas pressure at the GTP inlet.
Compression of gas for its downstream transportation;
Maintaining of the required gas pressure at the GTP inlet.
Thus, BCSs are used for extension of the stable gas production period at gas and gas condensate fields where formation pressure is decreased to the point at which pressure in the field gathering main pipeline, at GTP and in trunk gas pipeline restricts flow rates of the wells. In other words, BCS allows to maintain the GTP capacity at the designed level and to increase gas recovery factors as decreased pressure at the BCS inlet may be used for wellhead pressures decrease and increase of flow rates.
Reservoir simulation plays an important role in the petroleum industry. Today, there is a specific demand to run ensembles of megaand even giga-cell models. The iterative solution of the large systems of nonlinear governing equations, which describe the multiphase mass transfer in the subsurface, takes the most of the simulation time. The linearization part of the solution process occupies a significant fraction of that time, especially in compositional models. Moreover, the implementation of the linearization step usually embodies the most substantial, complex, and specific part of the computational loop in modern simulators, defining which physical mechanisms and assumptions are employed. This significantly complicates the implementation of simulation codes for heterogeneous computing hardware, which promises significant improvements in simulation time. In this work, we use the recently proposed Operator-Based Linearization (OBL) approach to develop a general purpose reservoir simulation code aiming to substantially decrease the simulation time. OBL offers a simplified linearization method, enhancing the computational performance of simulation and providing an opportunity of a painless porting to heterogeneous computing architectures. To distinct the contribution of both factors, we developed two versions of the compositional simulation prototype code: for traditional CPU and GPU-accelerated hardware architectures. While the former allowed us to speed the linearization stage up by an order of magnitude in comparison with the conventional approach, based on Automatic Differentiation (AD), the latter improved it further by another order of magnitude. The developed prototype realizes the potential of the OBL approach and GPU computing architecture, proving significant improvement in general purpose simulation performance.
In this work, we have designed and implemented a massively parallel version of the Semicoarsening Multigrid Solver (
The design of the algorithm uses a combination of plane relaxation and semicoarsening to efficiently handle anisotropies in 3D, (
The two versions of the solver were tested using various highly heterogeneous multi-million-cell problems derived from SPE10 Second Dataset Benchmark. For problems with sizes large enough, the GPU implementation, running on KEPLER-Based K40c cards, is found to be always faster than the multi-core implementation running on 12 Intel® Xeon® E5-2620 v2 2.10 GHz cores. In addition, the inherent serial nature of multiplicative multigrid, along with the approach taken to minimize the communication through PCI-e, were found to limit the scalability beyond 3-4 cores/GPU's.
C-CORE is engaged in understanding the iceberg and sea ice design loads needs of the energy sector. As the energy industry ventures into oceans with greater ice cover and more icebergs, there is a significant need for efficient engineering tools to plan and manage operations in exploration, production, and safety. Industry requires a range of scenarios for their risk assessments, where existing simulations can be computationally and time intensive.
C-CORE has recently started using the benefit of the General Purpose Computing on Graphical Processing Units (GPGPU) approach. This approach has shown significant speed up of several numerical ice engineering applications related to icebergs and sea ice. The investigated model types are Monte-Carlo type approaches for probabilistic design method, and quadratic discriminant. GPU computing with Compute Unified Device Architecture (CUDA) is a new approach to solve complex problems and transform the GPU into a massively parallel processor.
The present study applies the GPGPU technology to a Monte-Carlo simulation, used for a sea ice load application. The objective of this study is to measure the performance of the GPU using CUDA, and compare against the serial Central Processing Unit (CPU) using C++ and MATLAB implementations. Results show a speedup of up to 2,600 times of the GPGPU implementation compared to the MATLAB implementation, reducing the elapsed time from about 1.5 hour to about 2 seconds. This strongly indicates that the GPGPU approach can help the industry to significantly reduce the time required for the simulations.
In the past 30 years, the development of Computational Geophysics in China can be divided into four periods. From 1986 to 1991, it was the period of theoretical study and numerical experiment, which concentrated on the theory of tomography, inversion and pseudo-differential operator; It was the mainframe and personal computer (PC) that got widely used. From 1992 to 1997, it was the period of three dimensional (3D) imaging with integration and two dimensional imaging with wave equations, mainly studying the prestack migration, joint inversion of logging and seismic data and multi-wave data processing; We apply the high-performance PC, MIPS and SUN work station to the computation. From 1998 to 2007, it was the period of 3D imaging with wave equations, focusing on 3D wave equation prestack depth migration and 3D integration depth migration; PC-Cluster was the main tool for computation. From 2008 to now, it has been the time for 3D large-scale imaging. We do research on 3D prestack time migration, 3D reverse-time migration (RTM), 3D anisotropic RTM, 3D full wave inversion (FWI), 3D elastic RTM, and 3D elastic FWI, and adopt GPU and multinuclear CPU for computation.
Presentation Date: Monday, October 17, 2016
Start Time: 4:10:00 PM
Presentation Type: ORAL
Tsunamis generated by earthquakes generally propagate as long waves in the deep ocean and may be mathematically described by the shallow water equations (SWEs). Tsunami propagation and inundation usually involve a vast problem domain, which requires a highly efficient numerical model to provide accurate predictions. In this work, a hydrodynamic model that solves the 2D SWEs using a finite volume Godunov-type shock-capturing scheme is comprehensively tested on different hardware devices, covering both Central Processing Units (CPU) and Graphics Processing Units (GPU), for efficient tsunami modeling.
Tsunamis may cause huge loss of lives and economic damage as evidenced by the 2004 Indian Ocean event and the 2011 Japan event. Numerical prediction of the tsunami propagation and inundation provides essential information for evacuation management, risk assessment, city planning and structural design. Numerical models are also an indispensable component in most of the tsunami forecasting and warning systems.
Tsunami propagation and inundation can be mathematically represented by the shallow water equations (SWEs) or Boussinesq equations with an acceptable level of accuracy. Most of the prevailing tsunami models solve the SWEs or Boussinesq equations using finite difference methods (FDM) (Imamura, 1996; Titov and Synolakis, 1995), finite volume methods (FVM) (Leveque et al., 2011), finite element methods (FEM) (Tinti et al., 1996) or smoothed particle hydrodynamics (SPH) (Benedict and Robert, 2008). However, a tsunami event usually takes place in a vast domain and assessment of tsunami impacts may need multi-scale simulations that can accurately predict wave propagation across the ocean as well as inundation in urban areas requiring high-resolution representation of the topographic features. The high computational demand of this type of modeling exercises hinders wider application of most of the existing tsunami models.
In order to improve the computational efficiency of tsunami models to facilitate multi-scale simulations, different approaches have been widely reported in literature, including adaptive mesh refinement (e.g. Leveque, et al., 2011; Popinet, 2011) and parallel computing (e.g. Lavrentiev-jr et al., 2009; Pophet et al., 2011). In recent years, attempts have also been made to explore the potential of the graphics processing units (GPU) for improving model performance. GPU accelerated models have been presented in computational biophysics (Owens et al., 2008), computational fluid dynamics (Crespo et al., 2011), computational hydraulics (Brodtkorb et al., 2012; Smith and Liang, 2013), among other fields. More recently, authors have also attempted to develop CUDA-based GPU models (Vazhenin et al., 2013; Amouzgar et al., 2014) for tsunami simulations to further demonstrate the potential of this modern high-performance computing technology. However, the model performance across different devices has not yet been adequately compared to fully justify the benefit of this new development.