GPU accelerated particle-based computational acoustics solving based on SPH

1. Abstract: Smoothed particle hydrodynamics (SPH) is regarded as a pure Lagrangian approach, which can solve fluid dynamics problems without the creation of mesh. In this paper, a paralleled SPH solver is developed to solve particle-based computational acoustics (PCA). The aim of this paper is to study the feasibility of using SPH to solve acoustic problems and to improve the efficiency of solving processes by paralleling some procedures on GPU during calculating. A stand SPH code running serially in a CPU is proposed to solve wave equation. This is a wave propagating in a two-dimensional domain. After finishing the computation, the results are compared with the theoretical solutions and they agree well. So its feasibility is verified. There are two main methods for searching neighbor particles: all-pair search method and linked-list search method. Both methods are used in different codes to simulate an identical problem and their runtimes are compared to investigate their searching efficiencies. The runtime results show that linked-list search method has a higher efficiency, which can save a lot of searching time when simulating problems with huge amounts of particles. Furthermore, the percentages of different procedures’ runtimes in a simulation are also discussed to find the most consuming one. Then, some codes are modified to run in different GPUs and their runtimes are compared with those of serial ones on a CPU. Runtime results show that the paralleled algorithm can be more than 80 times faster than the serial one. The result shows that GPU paralleled SPH computing can achieve desirable accuracy and speed in solving acoustic problems.


Introduction
SPH [1] is a Lagrangian meshfree particle method for solving the fluid dynamics equations [2,3]. It is originally used for astronomic problems and has received remarkable success.
Since it is independent with mesh, it is suitable for solving large-enough system for longenough times [13]. In recent years, it has been employed to solve acoustic problems [14][15][16]. In order to obtain a relatively high precision, however, huge amounts of particles have to be adopted.
Solving these equations in a central processing unit (CPU) is time-consuming. Such requirements have limited its use in high performance computers [17].
To overcome these limitations, various accelerating methods have been employed. The traditional method has been adopted High Performance Computing (HPC) which involves the use of huge amounts of CPU [18]. But this is a less than desirable solution because the advance of computer industry can't keep up with the requirement of computing. The second type of acceleration approach is the use of Field-Programmable Gate Arrays (FPGAs). But the main use of this approach is in the field of astrophysics where smoothing lengths are variable. Hence its application is also limited.
The third, and perhaps the most promising one, is GPGPU technique [19]. GPGPU technique can take full advantages of the hardware of a computer, which uses GPU as a co-processor with GPU [20]. A GPU (Graphics Processing Unit) normally has hundreds of or even thousands of threads and can thus proceed huge amounts of calculation at a same time [21]. Due to GPU's lower energy costs, maintenance and price, GPGPU has now established themselves as an attractive alternative to massively parallel clusters [22]. Since the emerge of GPGPU technic, various dedicated programming frameworks have been created, such as CUDA (Compute Unified Device Architecture) and OpenCL(Open Computing Language) [23].
In 2006, NVIDIA provided its own development of CUDA, which is proven to be a successful framework in scientific computing. CUDA, a C/C++ extension, allows developers to easily harness the power of GPU in scientific computing [24]. Since developers don't need to understand the hardware, they can learn this new language very quickly. CUDA allocates the load of calculating into three-tier architecture: grid, block and thread. A grid consists of a lot of blocks, which can be one, two or three dimensional. While a block consists of numbers of threads, which can also have one, two or three dimensions. Thread is where kernel functions are processed. Kernel is a combination of a couple of algorithms, which is the minimized part that running in a GPU. Out of force of habit CPU and GPU may be called host and device respectively in the following part of this paper.
The parallel architecture of GPU enables itself perfectly suitable for SPH, a Lagrangian meshfree methods. SPH's computation domain can be divided into small parts and they are independent of each other. Given such condition, it is feasible that each thread of a GPU solves a small part of the sub-domain. So it is easy for GPU to simulate a large number of particles (10-100 million). In a literature [25], the use of the Nvidia CUDA programming paradigm has enabled SPH schemes to acquire a speedup of up to two orders of magnitude over a single thread in a CPU.
The structure of this paper is presented as follows: the basic theory of SPH is presented in Chapter 2. The structure of SPH code and the comparison of two different searching methods are presented in Chapter 3. CPUs and GPUs used in this paper are introduced in Chapter 4. The model and validation of the method are also presented Chapter 4. In Chapter 5, the results of GPU are analyzed and different factors including platform and particle number are discussed.

Basic concepts of SPH
In order to better understand the parallel method of SPH, a brief introduction of this approach is now presented. In SPH, a set of particles represent the fluid and their interaction is changing by time. As shown in Fig.1, Only a certain number of neighbors can influence a given particle and this is usually determined by distance. The dynamics are governed by a set of simultaneous ordinary differential equations in time.

Fig.1 Illustration of the smoothing function
Functions in the SPH method are represented in a particle approximation form.
For example, an arbitrary function f (r) can be written as where f is a function of the position vector r, W is the smoothing function (or the smoothing kernel ), Ω is the volume of the integral over all space, and h is the smoothing length. The chosen smoothing function has the following properties: its integral of the volume is a unit; it meets the Dirac delta function when h→0; its value is zero beyond the support domain; it decreases as the distance increase inside the support domain; and, last, it is symmetric. In this article, the Gaussian smoothing function is adopted.
The particle approximation for the function f (r) at particle i can be written as where r j is the position of particle j, N is the number of particles in the support domain, m j is the mass of particle j, and ρ j is the density. In the SPH convention, the kernel approximation operator is marked by the angle bracket <>.

SPH formulations of acoustic waves
In acoustic simulation, the governing equations for constructing SPH formulations are the laws of continuity, momentum, and state. There are two assumptions that are always used to simplify common acoustical problems. On the one hand, the medium is lossless and at rest, which means that an energy equation is unnecessary; On the other hand, a small departure from quiet conditions occurs.
They can be described as follows: where ρ is the fluid density, c 0 is the speed of sound, P is the pressure and u is the flow velocity.
The sound pressure δp, the particle velocity δv and the density change δρ are taken to be small quantities of first order.
By substituting these expressions into the acoustic equations and discarding second-order terms, the continuity, momentum and state equations (for ideal gas) can be written as where the subscript i and j stand for variables associated with particles i and j , and u ij = u iu j .
The momentum equation in SPH formulation can be obtained as Particle approximation of the equation of state for ideal gas is

Structure of the SPH code
This section describes how this code is paralleled in GPU and its data structures. As is shown in Fig.2, the current GPU-accelerated model of PCA may be divided into seven different parts which are related to input, initialization, searching neighbor particles (SNP), particle interaction (PI), system update (SU), output and writing output data. Basically, the input is designed to read information about physical parameters, such as the size of domain and initial particle size before a simulation starts. After reading input data, a kernel is launched to initialize the data, e.g. particles' location (coordinate x and y), and velocity (v and u). Note that these data are initialized on device instead of host, which can avoid the time-consuming process of transferring data between host and device. If a code uses linked-list search algorithm to search neighbor particles, some kernels are also needed to compute the cell that a particle belongs to and the particle number in a cell (as is shown in later of this section). Neighbor searching is an essential step of a SPH algorithm. Neighbor particles must be defined and stored in a list. After building the neighbor list, all particles in the support domain of particle i can be found and the pairwise interactions among adjacent particles can be calculated. After that, all physical magnitudes describing the state of the particles should be advanced according to the interaction of particle i. When it is necessary to output the results (the calculating time or timestep is due), data are copied back to a host and written onto files in a hard drive. Initialization, searching neighbor particles, particle interaction and system update are entirely executed on the device during a simulation, which avoids communication time between host and device. The other three steps are executed on the host.
Among these steps, searching neighbor particles and particle interaction may be the most time-consuming procedures in a simulation. Hence, concentrating on these steps would be advisable.
There are two algorithms been implemented: all-pair search and linked-list search.

All-pair search
With all-pair search algorithm, one thread corresponding to a specific particle searches all particles within the support domain to find its adjacent particles. Consider one computation domain consisting of N particles. To find particle its neighbors, a single thread need do N*(N-1)/2 calculations. If the particle number is small, little would do to the performance of GPU. However, when the number of particles increases gradually, the efficiency of GPU would reduce dramatically and the total computing time would multiply by square. For example, if it costs 1s for a thread to find neighbors in a 0.01 million particles domain, the time for a 1 million particles domain would be 10000s(nearly 3h). So the direct searching method is a time-consuming algorithm in searching neighbor particles.

Linked-list search
Linked-list search has been widely used for SPH simulation of incompressible flows. To use the method, all particles are firstly matched with a uniform grid with a width that equals to the maximum searching radius, normally 2 or 3 times of smoothing length. For an arbitrary particle i, all its effective neighbor particles locate in the adjacent 9 cells (consider a 2D domain) as is shown in Fig.3. So the total calculations would be about N. Surely, this method is a much more effective technic, though its data sorting strategy may be slightly complex.

Fig. 3 Example of neighbor particles when using uniform grid searching method
As is shown in Fig.4, when implementing linked-list search method, each particle is given a unique index according to its location in the domain and each cell is provided with an index.  Then, the cellBegin array is created to store the first particle location of each cell. After that, the cellParticleOrder array is created to list all particles according to the sequence of cell that a particle locates in. Note that these arrays are created in the initialization part on the device. During the simulation, each particle's location is likely to change due to the interaction between it and its neighbors every timestep. As a result, some arrays need to make a corresponding modification. All through the simulation, these three arrays are stored in the global memory of GPU, which means there is no need to make any communications between CPU and GPU. Since the costly communications can be avoided, the efficiency could be improved greatly.
During each timestep, four kernels are launched sequentially to finish the simulation. Table.1 summaries the functions of each kernel.   5 shows the pseudo code of Kernel A. For an arbitrary particle i, its cell is found according to particleIndex array. Then its adjacent cells are found in cellNeig. After that, all particles in the cell are compared to find its neighbor particles. Lastly, the interaction between particles is calculated.

Fig.5 Pseudo code of Kernel A
When launching Kernel C, each thread corresponds to one cell. While in other kernels, each thread corresponds to one particle. So, when the particle number is big enough, it is possible to achieve a high occupancy rate in block and register, which is a useful way to hide delay and improve efficiency.

Hardware
In order to assess the SPH's parallel performance, four different GPUs have been used in the simulation: Tesla K20C, Quadro M4000 and GeForce GT730. As a dedicated scientific computation device, Tesla can accelerate the speed of big data analysis effectively and have been used in professional workstations and HPC clusters. Quadro GPU is specified in image processing, which is suitable for the computation and plot of complex pictures. GeForce GPU is often used in personal computers and games. The major differences between the three GPUs are the clock rate, the number of processors and the available global memory. The former two factors have a huge influence on the speed of computation, whereas the available global memory partially determines the maximum particles that can be simulated in the GPU. As for the host, these three GPUs cooperate with three different CPUs: Intel Xeon E5-2680, Intel Xeon E5-2620 and Intel Xeon E5-2650, whose clock speed are 2.4GHz, 2.1GHz and 2.6GHz respectively. The performance attributes of these particular cards are summarized in Table.2.

Model
To investigate the accelerating performance of GPH for SPH, the propagation of a wave in a two-dimensional domain is simulated and the solving process is paralleled by GPU. The wave equation is presented as follow: Which can be written as: where t represents time (propagation starts when t= 0) and x is the geometric position. c 0 is the wave propagation speed which is 1 m/s. Ma is the Mach number which is 0.05 The acoustic model is shown in Fig. 6. The sound velocity at t = 0 is plotted with a solid line, while the sound velocity at t =10 s is plotted with a dashed line. Note that the velocity of particle doesn't change with Y coordinate if two particles have the same X coordinate. So various particles with same Y coordinate (y=0) are shown in Fig. 6.

Fig. 6 One-dimensional sound propagation model
The computation area is a 2D domain, whose x coordinate ranges from -10 to 40 and y coordinate ranges from -10 to 10. The bottom side (y=-10m) and the top side (y=10m) are set as the periodical condition which can provide a non-reflected condition for the wave's propagation. While at the left side (x=-10m) and right side (x=40m), the particles are set free in the computation. Since the SPH computational region is far longer than the sound propagation distance, any effects caused by ignoring this boundary will not propagate to the sound propagation domain, and therefore can be neglected. During each timestep, each particle's coordinate and velocity need to be calculated. To research the relationship between GPU's accelerating performance and particle number, the minimum particle number in the domain is 1K and the maximum is 62.5M, which can be altered by changing the initial space (∆x) between particles. In all cases, the timestep is set as dt =0.2*∆x and the smooth length takes h=2.13*∆x. The Courant-Friedriches-Lewy number is written as C CFL for short, and C CFL = u∆t / ∆x. The particle density ρ 0 is 1.0kg/m 3 .

Validation
A code running in CPU is first proposed to verify the feasibility and accuracy of the SPH Wave propagate 1 m/s method for solving wave equation problems. Note that all-pair search algorithm is used in this code.
The test cases run in the No.1 system. All the results presented in this as well as the following subsection are obtained via double precision. Fig.7 (a) and (b) shows the simulated results of velocity distribution at t=5s and t=10s, which clearly depicts the process of sound wave propagation. The initial space is 0.01m. The comparison between theoretical solution and simulated results along y=0 at t=5s is shown in Fig.7 (b). The theoretical solution is plotted with a solid line while the SPH results is plotted with red points. In order to clearly identify the numerical results, the points are plotted at intervals of 12.
Where N is the number of particles in the computation domain, V the is the theoretical solution of particle's velocity; and V sim is the simulation result.

All-pair search & Linked-list search
In theory, using all-pair search method for nearby particles searching is costly, whose calculation load is proportional to the square of particle number. Whereas the calculation load of the linked-list search method is proportional to particle number. A case's calculation time would increase dramatically if its particle number multiples. In order to assess the actual efficiency of the two algorithms, two codes based on the above two searching algorithms are programmed and they run in CPU. Then, these two programs were used to simulate different cases whose initial space ranges from 1m to 0.0625m and the particle number ranges from 1K to 0.256M. The efficiencies of the two search methods are analyzed by comparing their computation times. Since these two codes do not require GPU's co processing, their computation times are closely related to CPUs' performance. The comparisons are based on the results of the 10 th timestep. As is shown in Fig.9(a), the actual runtime increases with the number of particles dramatically.
For all-pair search method, when the particle numbers is 1K, the runtime is 0.15s; and when the particle numbers is 256K, the runtime is 10987s. While for linked-list search method, the runtimes are 0.07s and 25.9s respectively.
To examine the runtime difference between the algorithms, each average runtime was compared by calculating speedup for each simulation. The speedup is computed using: where lg1 t A and lg 2 t A are mean runtimes for algorithm 1 and algorithm 2, respectively. Although these two codes run in different CPUs, the speedups gained in different cases are almost the same. This comparison clearly shows the higher efficiency and advantage of using linked-list search method for nearby particles searching.
However, with the number of particles continues to grow, the runtime will increase dramatically even if linked-list search method is used for searching neighbor particles. Fig.10 shows the runtime varies with the number of particles which ranges from 1K to 8M. When the number of particles is small (1K-1M), the runtime is linear proportional to it approximately. But when the number of particles is relatively large (1M-8M), the relation is much similar to a quadratic curve. This is because particles' coordinate, pressure and speed are stored in the form of a dynamic array in RAM. When the array length is short, CPU can mask the changes of the reading and writing efficiency of an array. As the array length increases, the reading and writing speed is getting lower and lower. As a consequence, although the computation time of a single particle does not increase, the time of reading and writing increases, resulting in a dramatic increase in runtime. In summary, linked-list search method is effective in cases whose particle numbers are below 10M.

Percentage of time in different steps
In previous sector, it is clear that linked-list search method is much more efficient than all-pair search method. In the following sector, all codes use this method for searching neighbor particles, which can greatly reduce the runtime. However, SPH is still a time-consuming method. All data need to be initialized before simulating. Then there are three main steps in each timestep or a loop: searching neighbor particles (SNP), particle interaction (PI) and system update (SU). The runtimes of these steps add to the total time of a single timestep. As a consequence, it is advisable to find the most consuming step in a simulation and then paralleled the step in GPU to reduce the runtime of a simulation. Fig.11 gives the percentage of three tasks in a loop. The runtime is the average time of 100 steps. By varying the initial space, particle numbers of the domain are 0.025M, 0.1M, 0.4M and 2.5M respectively. All the cases run on system 1. Fig. 11 Percentage of time that each task uses in a timestep From Fig.11, one can see that the percentage of time that three tasks use in a loop are almost the same, no matter what the particle number is. Particle interaction is the most consuming procedure, which accounts to about 53% of total runtime. Searching neighbor particle also consumes a lot of time, which takes around 43% of runtime. As for system update, it takes a very small part of total runtime, which can be omitted to some extent.
Initialization is another important procedure that may influence the total runtime in a CPU code. Since there is no need to transfer data between host and device, the time can be saved. Fig.12 shows the percentage of time that initialization and a loop cost in a simulation consisting of only one timestep. It can be seen that the percentage of initialization increases with the number of particles in a computation domain. When particle number is 0.025M, the percentage is 7.1%; while when particle number increases to 2.5M, the percentage leaps to 88.2%. Although in simulations which requires many loops the percentage declines, it is still sensible to reduce the time consumed in initialization. So the runtime of a simulation that consists of only a few steps can be reduced greatly.
In conclusion, the times devoted to initialization, searching neighbor particles and particle interaction are huge. In GPU computation, these three steps should be paralleled to reduce total runtime and to promote efficiency.

GPU results
In previous sections, it has been proved that it is feasible to use SPH method to solve wave equations and it can achieve a relatively high precision. Moreover, in a certain range of particle number, using linked-list search method to search neighbor particles can achieve a very high efficiency, which can save a lot of runtime compared to all-pair search method. In this section, a  One can see that the runtime for input is invariable as the data inputted into device are constant.
After data are inputted into a device, then memories need to be allocate for a number of arrays, such as the array for particle position and velocity. Then all particles' position, velocity and pressure need to be initialized. These two procedures add to task two: initialization. The runtime of initialization is linear proportional to particle number approximately and it is much smaller than that of codes run on host. For example, when particle number is 0.025M, the runtimes for host and device are 37.3ms and 1.23ms respectively; when particle number is 10M, they are 457.7ms and 376.6s respectively.
So the paralleled code can greatly reduce the runtime of initialization. Data output takes a small part of the whole calculating time. When particle number is 10M, its proportion is only 2%. When particle number is relatively small, data input and initialization take a big part of the runtime.
However, with the increase of particle number, there total percentage decreases dramatically. It drops from 87.45 to 31.5% when the particle number increases from 0.4M to 10M. If there are more timesteps in a simulation, its percentage will decrease more.
In a timestep, the runtimes for SNP, PI and SU increase with the number of particles linearly approximately. They are also much smaller than those of codes run on host. For instance, when particle number is 0.4M, the runtime of host is 22 times larger than that of device, which are 3186ms and 144.6ms respectively. However, the particle interaction part takes a large part of the time in a timestep. As is shown in Table.5, particle interaction takes about 80% of all computing time of a timestep, though the proportion drops slightly with the increase of particle number. Searching neighbor particles only involves read data from memory without calculating. On the contrary, particle interaction needs to calculate a lot of data, such as particle distance, velocity and pressure. GPU's calculation capacity is much inferior to CPU's and dealing with data which are double precision consumes more time than that of single precision. As is mentioned before, all data are stored in GPU's global memory in double precision to achieve a higher calculating precision. As a consequence, particle interaction takes much more time than searching neighbor particles and system update.  Table.6 and Figure.14 summarize the computational times and speedups in three different systems. Note that all the simulations in CPU are single-thread. As is shown in Figure.10, a significant speed up can be achieved by using GPU parallel computing. When the particle number is relatively small, the speedup increases with it dramatically. However, when the particle number is relatively large, the speedup is approximately constant. The speedups vary with different systems.
As a professional scientific computing GPU, Tesla K20C can achieve good performances. For example, it can accelerate the speed of computing to 83 times when particle number is 5M. The number of particles that it can simulate is massive. For instance, it can simulate a case of 50M particles, which only takes 15s. GeForce GT730 and Quadro M4000 also show good performances, though they cannot match with Tesla K20C. In short, using GPU paralleled computing can accelerate the solving of wave equations effectively.

Conclusions
A parallel implementation of SPH for solving wave equations has been developed in this paper.
The code is paralleled utilizing CUDA technique and is able to carry out acoustic problems with a large number of particles. The major contribution of this paper can be summarized as following: (1) Particle-based computational acoustics is paralleled on GPU. The feasibility of this method is verified and it can achieve a desirable accuracy. All the main procedures in a simulations such as initialization, searching neighbor particles, particle interaction and system update are paralleled in GPU, which can save a lot of runtime compared with codes run in CPU.
(2) Two different neighbor particle searching methods are developed. Their efficiencies are discussed by comparing their runtimes. The result shows that linked-list search method can be hundreds of times faster than all-pair search method in some cases.
(3) The percentages of time devoted to different procedures are analyzed to find the most time- consuming procedures. The result shows that searching neighbor particles and particle interaction take more 90% of time both in CPU and GPU.
(4) The performances of different calculating platforms which vary in CPU and GPU card are discussed. Tesla K20C is the most suitable card for paralleled calculating. It can simulate more than 50M particles in less than 15s in a timestep. When the particle number reaches 2.5M, it can get a speedup of 83.2x.