5. Experimental Results
Since tumor growth simulations require a high resolution, we have chosen a grid to represent the tumor tissue. This grid size limits the tumor to 1,048,576 cells, which are initially empty (non-cancerous tissue) at the beginning of the simulation and can be progressively occupied by tumor cells. A grid of this size provides sufficient spatial resolution for meaningful biological modeling by ensuring that tumor progression, diffusion, and cellular behavior can be accurately simulated by the program. In addition, modern GPUs have multiple streaming multiprocessors (SMs) that can simultaneously process multiple blocks of 1024 threads, ensuring full utilization of all GPU cores across multiple SMs.
In our model, each thread processes a cell represented by a grid point. Since the grid is stored in a 2D array, the matrix can allow us to perform efficient memory access due to merged global memory read/write operations, row-major order access optimization since memory loads are sequential, and better cache locality since each warp processes a contiguous block of memory.
5.1. Performance Comparison
The performance of the proposed tumor growth simulation was evaluated using various execution parameters and CUDA grid configurations. The total execution time and the number of processed cells per time unit were measured with a
grid over 150 simulation days. A comparison with previous work [
11] highlights three key points: (a) the proposed algorithm processes a similar number of cells as in [
11], demonstrating that the tumor growth model remains consistent. (b) Unlike [
11], which used 4000 steps, this work uses 3600 steps to match the 150-day simulation time window, where each day consists of 24 steps,
steps. (c) Although the best execution times of the proposed approach are slightly worse that those in [
11], the grid cell distribution strategy enhances scalability, which was not achieved in the previous work.
5.1.1. CUDA Grid Size and Speedup Analysis
The speedup achieved when varying the CUDA grid size is detailed in
Table 1. The results show a significant speedup compared to single-thread execution.
For small grid sizes, speedup is limited, but as the grid size increases, acceleration improves significantly. The best speedup () is observed at a CUDA grid, where the simulation achieves processed cells/s in only
5.1.2. Execution Time Curve Analysis
The relationship between number of tumor cells and execution time per step is crucial. Ideally, this relationship should be a linear or sub-linear, indicating that each thread runs independently. However,
Figure 3a shows a slight curve, suggesting that cells interactions affect execution and the neighboring cells increase computational load.
Similarly,
Figure 3b shows that total execution time initially increases rapidly as the tumor grows, but eventually, growth stabilizes and follows an approximately linear trend. This confirms that the probabilistic model remains efficient even for larger simulations.
5.1.3. Kernel Execution Analysis
Kernel execution times and GPU memory usage were analyzed using
NvProf and
Nsight compute tools. The results in
Table 2 show that the most time-consuming processes are: (a) cell state updates and calculations (highest execution time); (b) memory transfers between GPU and CPU (every 5 simulation days); and (c) grid initialization and random number generation.
Regarding the occupancy analysis,
Figure 4a shows that the initial kernel execution produces high occupancy, as most threads execute similar tasks. Over time, occupancy drops to 36.28% (
Figure 4b), indicating imbalanced workload distribution. This suggests that GPU workload distribution to SMs could be further optimized to reduce occupancy drop and enhance overall efficiency.
5.2. Scalability Analysis
When designing a parallel implementation for GPU tumor growth simulations, the way the grid is partitioned will have a major impact on SM occupancy, memory access patterns, and load balancing. In this research, we explored two main approaches: (a) dividing the grid into 16 full rows or columns, (b) dividing the grid into 64 smaller regions to evaluate the performance of each region with respect to the GPU and to analyze the SM occupancy and cell processing distribution. A hybrid strategy (c) for optimized GPU utilization and memory efficiency is also discussed.
5.2.1. Dividing the grid into 16 rows and columns
The grid is split into 16 rows and columns, with each block processing an entire row or column. This makes it easy to allocate work to different GPU blocks. Memory is stored in a row-major order, so each thread processes a contiguous memory location, making memory accesses faster and more efficient. Threads in warp can also efficiently access contiguous memory locations, reducing memory latency.However,
Figure 5 shows a severe SM under-utilization, as some Streaming Multiprocessors (SMs) exhibit minimal activity with only 6.3 million cycles, while others are significantly overloaded. The tumour grows from the centre of the grid outward, so the blocks responsible for the outer edges of the grid will finish their work faster than those processing the centre, causing some SMs to remain idle.The first few SMs process the centre, which is much denser (i.e. they will have a high workload), while other SMs process sparse areas (i.e. low workload), leading to underutilization of the GPU cores.
Figure 5 also shows that there is much lower L1/L2 cache utilization, resulting in a higher number of memory accesses being directed to DRAM rather than being efficiently cached. It also shows lower DRAM activity, but this is probably a result of SM under-utilization rather than an improvement in efficiency.
It is difficult to distribute work dynamically with this grid division strategy because the execution time of each block depends on the tumor density in the row or column assigned to it. This means that some blocks may be finished while others are still heavily processed.
5.2.2. Dividing the grid into 64 smaller regions
In this case, instead of assigning all of a row or column to one block, the grid is split into 64 smaller parts. Each block of threads then processes a part of a row or column spread across these different parts. As shown in
Figure 6 using the CUDA Toolkit profiling figures, there is a lot more balanced SM utilization than in
Figure 5 because all SMs take part within a narrow range of execution cycles (between 428 million and 443 million).
Dividing the grid into 64 regions is an effective way to improve GPU utilisation, as it makes sure that no SM is inactive. This approach makes it easier to balance the workload across the SMs, as shown in
Figure 6. Since each block now processes small parts of cells spread across the grid, the workload is more evenly spread across the SMs. This strategy matches the tumor growth pattern (from the center of the grid to the edges) and is therefore less likely to create an imbalance because each SM receives work from multiple regions.In addition, since all SMs process similar amounts of work per iteration, there are fewer idle SMs, which keeps the GPU fully utilized.
Figure 6 shows that 64 regions is better at using L1/L2 caching, which reduces how long it takes to access global memory. This means that dividing the grid into 64 regions is more effective in keeping frequently accessed data in the cache, reducing DRAM bottlenecks and improving overall memory efficiency. It also means that more SMs are being used, and shows higher DRAM cycles, which indicates that more memory bandwidth is being used in computations. By dividing it into 64 regions, we make sure that the memory bandwidth is used efficiently. On the other hand, dividing it into 16 rows/columns leads to inefficient GPU execution due to idle or underloaded SMs.
This strategy also requires a more sophisticated indexing strategy to allocate blocks and ensure that work is distributed evenly across SMs. This results in more scheduling complexity in the SM schedulers.
5.2.3. Hybrid strategy
Instead of strictly dividing the grid into full rows or small regions, we combine both techniques to balance the memory access efficiency of the 16-row strategy with the load balancing of the 64-region strategy. This hybrid approach optimizes memory contention while ensuring high SM occupancy. The key components of the hybrid strategy are:
Hierarchical grid partitioning. Each thread block is assigned a set of small contiguous row segments rather than full rows/columns or scattered regions, improving spatial locality.
Coalesced memory accesses. Threads within a block process contiguous memory regions to enhance memory efficiency.
Dynamic load balancing. Thread blocks dynamically adapt to workloads from different regions, preventing idle SMs and ensuring even resource utilization.
The hybrid approach ensures even workload distribution across all Streaming Multiprocessors (SMs). Unlike the 16-row division, where some SMs remain underutilized (as low as 6.3 million cycles), the hybrid method prevents imbalances by dynamically redistributing workload. It also outperforms the 64-region strategy by maintaining optimal SM active cycles, leading to near–uniform GPU utilization (
Table 3).
Additionally, this approach maintains contiguous memory accesses, preventing excessive DRAM fetches and improving overall memory efficiency. By ensuring more localized memory access patterns, L1 and L2 caches achieve higher hit rates, reducing the need for costly global memory transactions. Compared to the 64-region strategy, which may cause scattered memory accesses and cache thrashing, the hybrid strategy ensures a more localized access pattern, leading to lower memory latency and improved overall performance.
5.3. Energy Efficiency & Resource Utilization
The use of memory for the cell grid division strategies in the GPU showed notable differences in the performance and the balance of the SM load. With the strategy of dividing the cell grid into 16 blocks, each block processes a complete row or column, which allows for the use of unified memory access, as this type of access is aligned with the way data is stored in the global memory of the GPU. Analysis by the CUDA Toolkit indicated that the shared memory usage per block was 48 KB, distributed efficiently among the concurrent threads. The size of the global memory space remained relatively low, as each block of threads operated sequentially on the row or column assigned to it in the distribution, resulting in minimal contention for access to the global memory.
The strategy of distributing the cell grid across 64 regions improved the load balance between the SMs, but increased latency due to access to global memory. This is mainly due to the fact that the thread blocks access the data following a more fragmented access pattern. In this case, the use of shared memory per block increased to 64 KB and, in addition, additional memory buffers were needed to manage the workload distribution. Meanwhile, the use of global memory increased by approximately 20% compared to the 16-block distribution strategy. Each block had to access more dispersed data, which increased cache error rates.
Analysis of the energy consumption of the strategy based on 16 rows and columns and the allocation of 64 regions of grid cells to the GPU’s SMs reveals notable differences in efficiency. Analysis of the CUDA Toolkit indicated that the SMs were not always busy with the first strategy, resulting in lower average energy consumption per execution cycle. Furthermore, this strategy benefits from reduced memory bandwidth usage, which is crucial for efficient GPU consumption.
On the other hand, the 64-region distribution strategy results in higher energy consumption, as it increases SM activity throughout the computation and because the blocks use more fragmented memory access patterns. The measurements show that all the SMs were active for many more cycles and that the lack of contiguous memory access caused higher energy costs per memory transaction, making this strategy less energy efficient compared to the 16-block strategy.
We can conclude that the strategy based on the distribution of 16 blocks was the most energy-efficient option, as it minimized the overhead due to data transfers in memory and optimized the number of execution cycles. The distribution strategy based on 64 blocks provided a much better computational load balance, although it increased energy consumption. Consequently, it seems that future optimizations could focus on hybrid memory allocation strategies to mitigate energy inefficiencies and maintain computational performance.