High-throughput of measure-preserving integrators for constant temperature molecular dynamics simulations on GPUs

Molecular dynamics simulation is currently the theoretical technique eligible to simulate a wide range of systems from soft condensed matter to biological systems. However, of the excellent results that the technique has arrogated, this approach remains computationally expensive, but with the emergence of the new supercomputing technologies bases on graphics processing units graphical processing units-based systems GPUs, the perspective has changed. The GPUs allow performing large and complex simulations at a significantly reduced time. In this work, we present recent innovations in the acceleration of molecular dynamics in GPUs to simulate non-Hamiltonian systems. In particular, we show the performance of measure-preserving geometric integrator in the canonical ensemble, that is, at constant temperature. We provide a validation and performance evaluation of the code by calculating the thermodynamic properties of a Lennard-Jones fluid. Our results are in excellent agreement with reported data reported from literature, which were calculated with CPUs. The scope and limitations for performing simulations of high-throughput MD under rigorous statistical thermodynamics in the canonical ensemble are discussed and analyzed.


Introduction
The accelerated use of molecular simulation methods, such as molecular dynamics (MD) simulations, has motivated an increasing interest in the implementation of rigorous algorithms in modern supercomputing technologies, which is a difficult task.Due to the purely numerical nature of MD method, which emerge from the laws of statistical mechanics, they require the use of efficient schemes to sample the phase space, maintaining control of thermodynamic variables at large scales of time and length.The combination of powerful computational technologies and the strict laws of statistical mechanics placed in a same simulation code is a great challenge nowadays.spaces or thermodynamic ensembles are characterized according to the type of physical variables that are to be evaluated.For example, the simplest case is where it is required to keep the energy constant, this case is the so-called microcanonical ensemble (NVE), where the thermodynamic condition is to maintain the energy constant (E), as well as the volume (V) and number of particles (N).However, for comparations with experimental conditions, it is often convenient perform simulations at constant temperature, which leads us to have a system in the canonical ensemble (NVT).In this case, the thermodynamic condition is to maintain the temperature constant (T), as well as the volume (V) and number of particles (N).The NVT ensemble is somewhat more difficult to generate than the NVE ensembles due to the requirement that the kinetic energy fluctuations must generate the correct distribution function corresponding to the ensemble.Temperature control algorithms comprise one of the fundamental parts in MD simulations.
Generally, when the Newtonian MD scheme it is modified to maintain the temperature constant is through of the tools called a thermostat algorithms (TAs).An efficient thermostat must meet certain requirements required of the statistical mechanics.For example, give a canonical distribution of speeds, and be ergodic, etc.Not only averages, but also fluctuations (dispersions) away from the average are also important.Several strategies to improve simulations at the canonical ensemble have been suggested in the literature [1][2][3][4][5].The simplest method is through the periodic scaling of velocities; this method is economically and numerically stable.However, it does not guarantee that a canonical space distribution is obtained, but they are still commonly used.The lack of ergodicity for some thermostat algorithms or for different temperature ranges can lead to a dynamic that does not correctly reproduce the expected thermodynamic properties [6][7][8][9].
Simulating a thermal bath is not a simple task.To achieve an accurate and efficient integration in the constant-temperature ensemble, additional variables may be added to create a "thermostat" that controls the temperature of the simulated system [10].From this idea arise the so-called extended Hamiltonian methods or Non-Hamiltonian methods [11][12][13][14].These variables or extended degrees of freedom regulate the time-averaged values of temperature in order to reproduce as accurately as possible the phase space in the considered ensemble.However, the use of this type of methods, although are accurate, are should be used with care, for example, the mass of extended variables must be carefully chosen as they affect spontaneous fluctuations in the system.Lippert et.al exposed computational challenges that involve controlling the temperature of a system using external degrees of freedom [15].Two fundamental observations were made.One of them was the high computational cost that carries the instantaneous updates of extended degrees of freedom.Second, observation to the wide separation between the timescales associated with the extended degrees of freedom and those associated with particle motion.The extended degrees of freedom generally evolve on timescales much longer than those of particle motion; this can result in numerical inaccuracy if the numerical precision employed is insufficient.
Several TAs has been development to make use of extend phase space of the system of interest by adding an extra dimension that takes into account the interaction of the system with the environment.The most popular TAs that is based on the extended phase space is the Berendsen [16], Andersen [17] and Nosé-Hoover (NH) methods [18,19].Undoubtedly, the NH method is currently the most popular isothermal simulation method, however it has a drawback that their equations are non-Hamiltonian in structure due to the non-canonical coordinated transformation, which precludes the use of symplectic integration schemes [20], NH thermostat has ergodicity problem.The failure of this approach under the conditions usually used in molecular dynamics calculations were clarified by Martyna et al. [21].They proposed that the shortcomings of the NH algorithm could be overcome by connecting a chain of thermostats to construct a NH chain (NHC).The figure 1 schematically shows the construction of the NHC method.That is, the first Nosé-Hoover thermostat (1 st NHT) in the chain is coupled to the physical system, the second (2 nd NHT) is coupled to the first Nosé-Hoover, the third (3 rd NHT) is coupled to the second thermostat, the (4 th NHT) is coupled to the third thermostat Nosé-Hoover, and so on, until forming a collective system of Nosé-Hoover chains (NHC).
When using non-Hamiltonian equations of motion, such as NHC equations, the numerical integrator should be consistent with a non-Hamiltonian generalization of Liouville's theorem [22,23].That mean that the integrator must be symplectic, that is, must conserve exactly the invariant phase-space volume, must have an approximate energy conservation, and time reversibility between other properties [24].A numerical integrator that achieves this is known as a "measurepreserving" algorithm.These numerical algorithms are based on decompositions of exponential operators, and the error in the total energy of the system is bounded [25].Also, these algorithms represent the explicit integration of non-Hamiltonian dynamical systems.So far, Tuckerman et al. [26] have developed a consistent classical statistical theory of certain non-Hamiltonian dynamical systems and have developed a "systematic way" of designing equations of motion of extended molecular dynamics which are known as MTK equations, from Martyna, Tobias and Klein authors.
MTK algorithms are already applicable to a wide range of scientific problems and have been expanding their applications in several areas, such as: Physical Chemistry [27], Theoretical and Computational Chemistry [28], Materials Engineering [29], Macromolecular and Materials Chemistry [30], Biochemistry [31], Cellular and Cell Biology science [32], and so on.The implementation of the MTK algorithms is not trivial and its evaluation is computationally expensive.So far, major MD codes such as LAMMPS [33], AMBER [34], HOOMD-blue [35] and GROMACS [36] have implemented the MTK algorithms, being LAMMPS the most used code with these algorithms.But nevertheless, these codes were designed and work in programming environments in CPUs.On the other hand, with the advent of graphics processing units (GPUs), the field of High-throughput MD (HTMD) has become a fundamental tool for pharmaceutical research [37], accelerate the innovation in materials research [38], in revolutionizing the genome-wide [39], to explore large effects of structural changes ensemble of proteins [40], for drug discovery [41], just to mention a few applications.However, these areas have been developed using huge and expensive CPUS clusters.Unfortunately, the generation of HTMD codes designed exclusively for perform molecular simulations on GPUs is very scarce.ACEMD [42] is the only code of MD specially optimized to run on GPUs and it is one of the world's fastest molecular dynamics engines.Nevertheless, ACEMD works with Langevin type thermostats, and to date, has not yet implemented the MTK algorithms.
The state of the art of HTMD reports that little effort has been directed to the efficient implementation of MTK algorithms on GPUs.The motivation for the use of integrators that explicitly preserve the measurement of the space phase arises as a need to have an efficient tool that guarantees ergodic sampling at large scales of time and length, this is only possible today using GPUs.In this work, we show the performance of measure-preserving geometric integrator in the canonical ensemble on GPUs, under the scheme of the MTK algorithms, which are the basis of the NHC thermostat.The capacity of our HIMD code can simulate more than one million Lennar-Jones type point particles achieving an average of 0.27 ns/day.In smaller systems, of the order of 30,000 particles, we achieve a production of 30 ns/day, it should be noted that most of the simulation results reported today are of this number of atoms, but with a production of ns/day of half or less than half.So far, no code with those capabilities has been reported using NHC thermostat.

Methodology
It is well-known that a symplectic integrator conserves the Hamiltonian and therefore achieves stable integration over a long time.This is because there exists a value that is exactly conserved by the approximated propagator.These numerical methods are based on the decompositions of exponential operators.A system that preserve the generalized phase-space metric, it will be consistent with the generalization of Liouville's theorem.The details of Liouville operator algorithms derivation for non-Hamiltonian systems, such as NHC thermostat, can be found in original papers sources [43][44][45][46], therefore we shall only outline the main NHC equations.
In order to generate a sampling of the canonical distribution (NVT), Hamilton's equations must be supplemented by a mechanism that allows the system to exchange energy with its surroundings.One popular method for mimicking the influence of the surroundings in MD is so called extended phase space approach.As already mentioned in the Introduction section, Nosé-Hoover method is one of the best schemes to generate a canonical distribution but has ergodicity problems.The failure of this approach was solved by Martyna et al. introducing the NCH thermostat [21].The main idea behind the NCH thermostat is that the physical position and momentum variables of the particles in the system are coupled to additional phase space variables that mimic the effect of the surroundings by controlling the fluctuations in the instantaneous kinetic energy.
The NHC thermostat [21] is a non-Hamiltonian MD scheme for generating the canonical ensemble.In this method, the ordinary phase space is extended to include a set of M thermostat variables  1 , …   and their conjugate momenta   1 , …    , which serve to drive the fluctuations of the kinetic energy in such a way that they average to the proper canonical value.These variables act as a heat bath coupled to the system.The equations of motion for an NHC system are The parameter  1 , …   are mass-like parameters (having units of energy*time 2 ) that determine the time scale on which the heat-bath variables evolve, and   is the Boltzmann constant.The term −( 1 / 1 )  in the momentum equation acts as a kind of dynamic frictional force.Although the average 〈 1 〉 = 0, instantaneously,  1 can be positive or negative and can, therefore, act to damp or boost the momentum.According to the equation for  1 , if the kinetic energy is larger than 3  /2,  1 , it will increase and have a greater damping effect on the momenta, while if the kinetic energy is less than 3  /2,  1 will decrease, become negative, and have a boosting effect.In this way, the NHC system acts as a "thermostat" regulating the kinetic energy.In a similar manner, the ( + 1)th heat-bath variable serves to modulate the fluctuations of the kth variable so that each heatbath variable (except the Mth variable) is "driven" to have a proper canonical average.Equations ( 1 where (, ) is the Hamiltonian of the physical system.When  = 1, the NHC system reduces to the simpler Nosé-Hoover system [19], which does not generate the corresponding canonical distribution.

Code validation
In this section, we show the results of an implementation of NHC thermostat on GPUs.First, we performed the validation of the code simulating a system composed of 864 particles of a Lennar-Jones (LJ) fluid.A comparison with the data of the literature is included, we show that the equilibrium properties are the same for the simulation time considered here, but our data are calculated in GPUs while those of literature are calculated in CPUs.The details of the simulations presented can be found in appendix A of this manuscript.
The Figure 2 shows the equilibrium properties of LJ fluid in bulk phase.The comparison between literature data (calculations in CPUs) from Johnson et al. [47] and simulation data from this work (calculations in GPUs) is presented.Figure 2a shows the behavior of the potential energy as a function (U * ) of density ( * ) for a wide range of temperatures, from T = 2.0 to T = 6.0.These values are in reduced units.In our case, the error bars are included.Figure 2b shows the behavior of the pressure (P * ) as a function of density ( * ) for a wide range of temperatures, from T = 2.0 to T = 6.0.All values are in reduced units.We can see that the agreement between the data of the literature calculated in CPUs and the data calculated in this work in GPUs is excellent.This is a test of the excellent implementation of the NHC algorithm in GPUs.These results show that the canonical ensemble given by the NHC algorithm correctly produces thermodynamic properties in equilibrium, performed in GPUs as is showed in this work.For a thermostat to be entirely consistent with the canonical ensemble, it should generate total energies according to the Boltzmann distribution for that system and generate kinetic energies consistent with the Maxwell−Boltzmann distribution [23].To verify this rule, first we evaluate the evolution of the conserved quantity () given by equation ( 3 , where i runs over the number of configurations   .  is the total energy of the whole system including the atoms and thermostat variables.No drift in the conserved quantity was observed in any of the simulations as we expected.Fig. 3B shows a symmetric density distribution for this system.The NHC motion equations increase the calculation time and the complexity due to the coupling of the thermostats; this causes the ergodicity of the dynamics to increase when increasing the available phase space of the dynamics.One of the main goals of using GPUs in MD simulations is to break with the time/length scales, that is, to simulate larger systems and longer times, preserving the correct application of the laws of statistical mechanics.To show the effectiveness of the NHC algorithm in the canonical assembly, we analyze the evolution of the temperature reduce ( T * ), the kinetic energy reduced ( K * ), as well as the distribution of the kinetic energy reduce ( K * ).The results show in the figure 4. We can see that ( K * ) obeys to the correct distribution dictated by the canonical assembly [25].This result shows a good agreement with similar results in the evaluation of geometrical properties that are applied to the algorithms designed to maintain the temperature in the canonical ensemble [3].It is important to emphasize the conservation of temperature, which shows any no drift.The calculations on CPUS presented in this section were performed with an inhouse code.(c) The numerical distribution reduced kinetic energy ( K * ) for the same system is showed.The details of these simulations can be found in the Appendix A. All thermodynamics are reported in dimensionless units.

Code performance
After showing the excellent comparison of thermodynamic properties in equilibrium obtained with both CPUs and GPUs for a LJ fluid in phase bulk, we will show below the performance of our code in different GPUs architectures.The performance benchmark test is based on LJ system also.These benchmarking simulations were run on our cluster Olinka [48], which is equipped with the latest GPUs technologies.As we mentioned before, the true power of using GPUs is to simulate systems composed of a large number of atoms, as clusters based on GPUs technologies allow.
To know some advantages like high efficiency, high speed and low cost of the NHC thermostat on GPUs, the execution time per step of MD and the speedup were measure, both parameters as a function of the number of particles.The result is shown in figure 5. We observe from Figure 5A, that the computational cost of time per step increases exponentially in the CPUs.As the number of particles increases, the time per step also increases.In the GPUs, the computational cost of time per step is practically constant.Speedup in Figure 5B is measured as the ratio of wall time elapsed for carrying out a specific simulation.In this case, we observe that for a system of 4000 particles a speedup of 45 is achieved, while a system of 23328 achieves a speedup of 60, therefore, the speedup achieved by going from a system of 4000 particles to a system of 23328 is of the order of 25.A speedup of this magnitude is significant because of the wide applications of MD simulations to biological systems.These tests were measured in the GPU model GTX-1080 for systems of LJ particles.).In Figure 6, a complete analysis of the performance of NHC thermostat is shown.Figure 6 shows that the best performance of NHC thermostat is achieved with GTX 1080 GPU cards, achieving a production of 62 ns/day.It has been documented that the best MD codes running on GPUs reach their best performance on these cards [42].This performance is achieved in single GPUs for a system of 10976 particles type LJ.Currently, our code can simulate up to 1.2 million particles.The benchmarking production is presented in the Figure 7. Simulating more than one million particles is a challenge that keeps many research groups busy.This size of systems is the maximum that a GPUs supports, due to the capacity of the memory.GPUs with greater memory capacity could simulate larger systems, and of course with a higher ns/day production.Figure 7A shows the decay of ns / day production as the particle number increases.This behavior is normal due to the cost of calculating the forces.The maximum size that we manage to simulate is about of one 1.2 million of LJ particles, obtaining only 0.27 ns/day on GPUs model GTX-1080, so it is seen more clearly in figure 7B. Figure 7C shows an example of the configuration of this system, which measures approximately 40 nm.That order of magnitude is of great importance where the capabilities of the HTMD are exploited.

Discussion
The key contribution of HTMD molecular simulations technologies is to accelerate the innovation in research both simple systems and complex systems.We think that new computational algorithms and strong collaborations between chemists, physicists, biologists, mathematicians, engineers and medical professionals will reinforce this area of research.However, the problems and challenges related to biomolecules, which are often expensive in the sense of numerical simulation, open a research window with a major difficult to answer.A question that has not been fully addressed to date is to what extent the kinetic properties can be reproduced correctly, this using MD to see extent simulations can be accelerated by these simulations methods [49].
Here, we present a new tool for enabling HTMD simulation under a strict methodology based on the NHC method.The accurate simulation in the canonical ensemble is a problem of great scientific importance.With this software we would like to perform simulations in order to be able to test predictions that serve as a point of reference in real experiments, in such a way that we can accelerate promising results.With molecular simulation methods, one is normally obliged to comply with the laws of statistical mechanics, describing with precision the statistical ensemble under study, in this case the canonical ensemble.Here we show how to combine the most advanced supercomputing technologies in conjunction with the highest technologies that generate integrators that preserve the measurements of the phase space for the canonical set at a very low computational cost.We also show the potential of GPUs that allow us to explicitly evaluate with enough precision the simulation of simple systems under the scheme of the NHC thermostat, thus exceeding the exact temperature modeling at large scales with respect to other temperature control algorithms.Accelerated molecular dynamics on GPUs as provided by our HTMD platform should be of wide interest for a large number of computational scientists as it provides performance comparable to that achievable on standard CPU supercomputers in a laboratory environment.Even research groups that have access to High-Performance Computing can find a useful tool in our code, which has the ability to run simulations locally for longer periods of time and with greater flexibility.Finally, we see that the race to have more and better MD codes in GPUS is in constant growth, and we need more innovation in this sense.

Conclusions
The evolution and innovation in the development MD software on GPUs is in constant expansion.However, the use of GPUs acceleration of condensed-phase matter MD simulations is still in its infancy.The pressure to achieve maximum performance has led to the use of approximations in statistical methodologies trying to avoid a real rigorous validation.The development of MD codes in GPUs seems to be an established and extremely active field, however, very few codes can be considered ready for production and even very few achieve the desired goal of making direct comparisons with real experiments, without making approximations.However, the current benefits of GPUs are enticing, and this is driving both code and hardware development.Despite the substantial progress made in the development of the code, the difficulty in programming GPUs devices persists, programming complex algorithms such as the NHC thermostat makes some groups choose to implement simpler thermostats, which are efficient at long times, example explicit are the NH and Langevin thermostats.With constant anticipated release of new technology NVIDIA will undoubtedly bring more competition in the development of more efficient software with more demanding implementations such as those presented in this work.It is anticipated that with the release of new versions of GPUs, MD codes will evolve rapidly, while research with these methodologies will increase exponentially in the coming years, but the limitation of implementing complex methodologies such as those presented in this paper is a latent challenge.

Figure 1 .
Figure 1.Schematic representation of the NHC method and its operating.The temperature of the physical system (red box) is controlled by the mechanism of the NHC equations (all four green boxes).That is, the first Nosé-Hoover thermostat (1 st NHT) in the chain is coupled to the physical system, the second (2 nd NHT) is coupled to the first Nosé-Hoover, the third (3 rd NHT) is coupled to the second thermostat, the (4 th NHT) is coupled to the third thermostat Nosé-Hoover, and so on, until forming a collective system of Nosé-Hoover chains (NHC).

Figure 2 .
Figure 2. Equilibrium properties of Lennard-Jones fluid in bulk phase: (a) Potential energy vs. density.The blue line corresponds to literature data from Johnson et al. [47]., which were calculated in CPUS, while the red dotted line are simulation data from this work but calculated in GPUs.The symbol of the error bars is included in our calculations; (b) Pressure vs. density.The continuous lines correspond to the data reported by Johnson et al. [47], while the data with symbols correspond to data obtained in this work.All units are dimensionless.

Figure 3 .
Figure 3. (a) Evolution of conserved quantity () as a function of simulation time (ps).The red line corresponds to average energy while the black line corresponds to instant energy; (b) Histogram of cumulative momentum distribution obtained from a trajectory calculated using NHC method.For this case T * = 2.0,  * = 0.7, and  = 864.

Figure 4 .
Figure 4. Sampling of temperature between CPUS and GPUs platforms; (a) Evolution of the reduced temperature ( T * ) of the system at equilibrium; (b) The behavior of the reduced kinetic energy ( K * );(c) The numerical distribution reduced kinetic energy ( K * ) for the same system is showed.The details of these simulations can be found in the Appendix A. All thermodynamics are reported in dimensionless units.

Figure 5 .
Figure 5. Performance of NHC thermostat on GPUS: (a) Performance CPU vs. GPU of time per step as a function of particles number; (b) Speedup of NHC thermostat as a function of particles number.These tests were measured in the GPUs model GTX-1080 for a system of 4000, 6912, 10976, 16384 and 23328 particles.The Speedup obtained by migrate the NHC algorithms to GPUs is approach 60 times faster than the CPU version.

Figure 6 .
Figure 6.Performance Evaluation and Benchmarking of NHC thermostat in different GPUs architectures.The performance of our code is tested on a system of constituted by 10976 Lennard-Jones type particles.

Figure 7 .
Figure 7. HTMD of NHC thermostat performance on GPUs; (a) Maximum performance on a GPUs GTX-1080; (b) Comparations of production in ns/day between two GPUs architectures to simulate one million of LJ particles; (c) A snapshot of a system composed by one million of particles.