Submitted:
25 July 2025
Posted:
30 July 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Benchmark and Problem Formulation: We introduce a curated dataset of 17 concurrency bugs and frame the failure-triggering task as a regression problem with sparse positives and stochastic labels—posing distinct challenges for conventional learners.
- Evaluation of Amplification Techniques: We systematically compare several model-guided search strategies and show that ensemble-based learning significantly improves bug-triggering probability within practical testing budgets.
- Practical, Black-box Testing Framework: Our approach treats the system under test as a black box, requiring no code changes or instrumentation, making it readily applicable in real-world testing workflows.
2. State of the Art in Bug Reproduction
3. Types of Concurrency Bugs
- Deadlock: A system state in which two or more threads are indefinitely blocked, each waiting for a resource that will never become available, e.g., because it is held by another. The system halts and cannot make further progress.
- Unexpected Data: Shared variables take on incorrect or inconsistent values due to unsynchronized access, race conditions, or improper interleaving of reads and writes.
- Concurrent Access: Multiple threads enter a critical section simultaneously, violating mutual exclusion and potentially corrupting shared state or breaking invariants.
- Missing or Weak Guarding: Inadequate protection of critical sections, often due to absent atomicity checks, incorrect condition synchronization, or overreliance on scheduling assumptions.
- Non-Atomic Operations on Shared State: Access to shared data is implemented via sequences of non-atomic operations, allowing interleaving by other threads to interfere with correctness.
- Incorrect Command Ordering: Synchronization operations are issued in the wrong order, violating required temporal constraints. For example, a thread signals a condition before another begins waiting for it.
- Misuse of Concurrency Primitives: synchronization constructs such as locks, semaphores, and condition variables are used incorrectly, e.g., in unintended contexts, or in ways that violate their semantics.
4. Summary of the Benchmark Problems
- Atomicity Bypass:
- A thread releases a lock before completing a read-modify-write, leading to data corruption despite apparent locking. See Section 9.1.
- Broken Barrier:
- Improper barrier reuse or reset causes some threads to wait forever, expecting others to arrive. See Section 9.2.
- Broken Peterson:
- Incorrect implementation of Peterson’s algorithm allows both threads to enter the critical section. See Section 9.3.
- Delayed Write:
- Operations are reordered due to compiler or logic flaws, leading to stale reads or broken invariants. See Section 9.4.
- Flagged Deadlock:
- Threads use flags and spin loops incorrectly, creating interleaving paths that deadlock. See Section 9.5.
- If-Not-While:
- A thread waits using an if condition instead of a while loop, leading to missed signals and unsafe access. See Section 9.6.
- Lock Order Inversion:
- Classic deadlock: threads acquire two locks in opposite order, causing circular wait. See Section 9.7.
- Lost Signal:
- A thread sends a signal before another begins waiting on a condition variable; the signal is lost, causing a deadlock. See Section 9.8.
- Partial Lock:
- Only part of the critical section is protected by a lock; race conditions still occur. See Section 9.9.
- Phantom Permit:
- A semaphore is released without a corresponding Wait, allowing more threads than expected to enter the critical section. See Section 9.10.
- Race-To-Wait:
- Threads race to increment a shared counter and both wait on a condition that never becomes true due to non-atomic updates. See Section 9.11.
- Shared Flag:
- A single boolean flag is used for synchronization without proper mutual exclusion, allowing concurrent access. See Section 9.15
- Signal-Then-Wait:
- A thread signals with notify_all() before the other enters the wait; the notification is missed despite a guarded while loop. See Section 9.16
- Sleeping Guard:
- A thread goes to sleep on a condition variable without checking the actual shared state, causing missed wakeups and deadlock. See Section 9.17
5. Interleaving Multithreaded Code
| Listing 1. Core simulation loop controlling the execution of multiple threads. Each thread yields a delay, and the scheduler selects the next thread to execute based on wake-up times. |
![]() |
| Listing 2. A thread modeled as a generator. Yields represent delays between atomic steps. The delays depend on a system-wide coefficient C and problem-specific parameters Di, with optional noise added |
![]() |
5.1. Evaluation Protocol
6. Bug-Amplification Methods
6.1. Baseline: Random Search
6.2. Simulated Annealing
| Listing 3. An implementation of Neighbourhoods Sampling. |
![]() |
6.3. Genetic Algorithm-Based Search
- Crossover: A two-point crossover (VectorKPointsCrossover) with a probability of 0.5 exchanges two genome segments between parent individuals. This promotes the recombination of useful substructures and accelerates convergence.
- Mutation: Uniform N-point mutation (FloatVectorUniformNPointMutation) is applied to 10 randomly selected vector components with a probability of 0.15. This introduces variation and helps the population explore new regions in the search space.
- Selection: We use tournament selection with a size of four, where the fittest individual among four randomly sampled candidates is chosen as a parent. This balances selective pressure and population diversity.
6.4. Classification-Based Method: Ensemble Stacking Classifier
Stacking Architecture
| Listing 4. The Ensemble Stacking Classifier. After experimentation, we found that stacking the four most common classifiers and combining their predictions using logistic regression gives the best results. We configured passthrough=True to allow raw features to reach the meta-learner and cv=5 for robust out-of-fold training. We also adjusted the number of iterations to cope with model complexity and used class_weight=’balanced’ due to skewed data, as bugs are rarely triggered. A two-layer neural network with adaptive learning rate further enhances abstraction and generalization. |
![]() |
7. Results
7.1. Overall Success Rates per Problem
7.2. Per-Problem Bug-Detection Rates
7.3. Top-k Case Effectiveness
7.4. Pairwise Statistical Significance Analysis
7.5. Convergence Analysis Across Methods
7.6. Summary of Key Findings
- Learning-based amplification significantly outperforms uninformed approaches.
- The ensemble classifier (Ens) consistently achieved the highest bug-triggering probabilities across nearly all test-case budgets and problems. With just 500 test-cases, Ens reached average success probabilities exceeding 0.53, whereas , SA, and GA remained below 0.13. At the full budget of 3900 test-cases, Ens achieved near-perfect detection (over 0.9 probability) in more than half of the problems, including LockOrderInversion, SignalThenWait, and IfNotWhile.
- Ens converges faster and with fewer test-cases.
- While , SA, and GA showed gradual or erratic improvements, Ens rapidly identified failure-inducing cases. For example, in RacyIncrement, Ens surpassed 0.9 success probability with fewer than 1100 test-cases, while GA plateaued at 0.07 and at 0.03 even after 3900 cases. This sample efficiency makes Ens especially valuable for real-world systems with costly or time-limited testing resources.
- Traditional search methods offer limited scalability.
- showed minimal improvement over increasing test budgets, with average performance rarely exceeding 0.15 across problems. SA’s performance improved modestly but remained inconsistent, failing to trigger bugs in several hard problems like SharedFlag and SemaphoreLeak. GA was more effective than and SA in moderately complex problems but still lagged behind Ens in both speed and final success rates.
- Problem hardness varies significantly and affects method effectiveness.
- Some problems were consistently easy (e.g., SignalThenWait and LockOrderInversion) and triggered by all methods to varying degrees. Others, such as SharedFlag, SemaphoreLeak, and BrokenBarrier, remained elusive, with only Ens achieving meaningful success (e.g., 0.49 in SemaphoreLeak vs. <0.03 for others). This suggests that learning-based methods are better suited for navigating complex or deceptive search spaces.
- Ens robustness is evident across all tested budgets.
- The bird’s-eye view (Figure 2) shows that across all 17 problems and at every tested budget (500, 1100, 2100, and 3900), Ens consistently led or tied for the highest success rate. Notably, in 13 out of 17 problems, Ens reached probabilities above 0.85 with 3900 test-cases, while GA exceeded 0.5 in only 7, SA in two, and in one.
- Integration of feedback powersEns performance.
- Unlike the other methods, which rely on sampling or mutation heuristics, Ens uses supervised learning to predict and prioritize high-risk inputs. This allows it to generalize from early failures, focusing search efforts efficiently. The result is not only higher probabilities of detecting bugs but also significantly fewer wasted executions.
- Ablation Study.
- We conducted ablation studies by removing components from the ensemble classifier and modifying its sampling heuristics. Specifically, we evaluated simplified variants of our pipeline, such as omitting SMOTE or disabling passthrough to the meta-learner. These reduced versions consistently underperformed relative to the full classifier configuration we present in the paper. In several cases, the simplified ensemble-based methods even performed worse than the brute-force baseline, highlighting the importance of each pipeline component in achieving effective bug amplification.
8. Related Work
8.1. Concurrency Bug Debugging Methods
8.2. Concurrency Bug Datasets
9. Detailed Description of the Benchmark Problems
9.1. Atomicity Bypass: Unexpected Data from Lock Misuse
- Description:
- Simulates two threads updating a shared counter under the false assumption that a critical section is properly protected. Each thread acquires a mutex, reads the counter, but then mistakenly releases the mutex before performing the update. As a result, both threads read the same value (e.g., 0), and both write back 1, overwriting each other’s increment. The final result is data corruption: the counter appears to have only been incremented once.
- Effect:
- A clearly unexpected data outcome, where both threads read the same initial value of the counter and write back identical updates, resulting in a lost increment. This leads to data corruption, as the counter reflects only one update instead of two, violating correctness expectations.
- Root Cause:
- Misuse of concurrency primitives: The locking discipline was violated by releasing the mutex too early.
- Insight:
- This demonstrates that simply using synchronization tools is insufficient - they must be used correctly and consistently to protect shared operations.
- Pseudo Code:

9.2. Broken Barrier: Deadlock from Barrier Misuse with Incorrect Participant Count
- Description:
- Three threads increment a shared variable and call SignalAndWait() on a barrier that is configured for only two participants. One thread calls SignalAndWait() twice before resetting the barrier, violating the expected usage pattern.
- Effect:
- This misconfiguration can lead to deadlock, as some threads may wait indefinitely for signals that never arrive. It may also cause assertion failures if the synchronization logic assumes a specific number of participants.
- Root Cause:
- A misuse of primitives, where the barrier is used in a way that contradicts its intended configuration.
- Insight:
- This problem illustrates the importance of synchronization primitives being correctly configured for the actual number of participating threads. Misuse of barriers can lead to subtle and difficult-to-diagnose concurrency failures.
- Pseudo Code:

9.3. Broken Peterson: Mutual Exclusion Violation in Generalized Peterson’s Algorithm
- Description:
- This problem involves a generalized version of Peterson’s algorithm for four processes. The implementation uses arrays to track process levels and a last_to_enter array to manage entry ordering. However, a critical assignment to last_to_enter[level] is omitted, breaking the algorithm’s tie-breaking logic.
- Effect:
- Multiple processes may enter the critical section concurrently, leading to a concurrent access.
- Root Cause:
- A missing or weak guard in the synchronization protocol, specifically, a missing update in the entry coordination mechanism.
- Insight:
- This example highlights how even small implementation errors in well-established algorithms can undermine their correctness. It underscores the need for rigorous validation of synchronization logic, especially in generalized or modified versions of classic algorithms.
- Pseudo Code:

9.4. Delayed Write – Assertion Failure from Non-Atomic Test-and-Set Simulation
- Description:
- A simulation models a test-and-set operation where one thread sets a shared variable x to a target value. However, another thread may interleave and modify x during a context switch, violating the assumption that x remains unchanged after being set.
- Effect:
- A concurrent access and unexpected data, often manifesting as an assertion failure when the invariant x == target is violated.
- Root Cause:
- An incorrect command ordering stemming from the test-and-set logic. The thread reads and later writes to x, but a context switch between these steps allows another thread to intervene and modify the variable, violating expected execution order.
- Insight:
- This case illustrates how concurrency bugs can emerge even in simulated atomic operations if the underlying memory operations are not properly synchronized. It emphasizes the importance of true atomicity in synchronization primitives.
- Pseudo Code:

9.5. Flagged Deadlock: Deadlock Risk from Complex Locking
- Description:
- Involves two threads using a combination of locking strategies, including recursive locks, try-locks, and conditional logic based on shared flags. The complexity of the locking protocol introduces multiple paths for acquiring locks, some of which may conflict or fail to release locks properly.
- Effect:
- A heightened risk of deadlock, as threads may become stuck waiting for locks that are never released or acquired in inconsistent orders.
- Root Cause:
- A combination of misuse of primitives and non-cooperative scheduling, exacerbated by the use of active waiting (spin locks) instead of blocking synchronization.
- Insight:
- This case highlights the dangers of over-engineering synchronization logic. Complex locking schemes, especially those involving conditional paths and re-entrant locks, are prone to subtle bugs and should be avoided in favor of simpler, more predictable designs.
- Pseudo Code:

9.6. If-Not-While: Deadlock and Missed Signals from Condition Variable Misuse
- Description:
- Two consumer threads wait on a shared queue using Monitor.Wait(mutex) when the queue is empty. A producer thread enqueues data and signals all waiting consumers using Monitor.PulseAll(mutex). However, the consumers guard the wait with an if statement rather than a while loop, failing to re-check the condition upon waking.
- Effect:
- This leads to two possible effects: deadlock, if a consumer misses a signal and waits indefinitely, or unexpected data loss, if a consumer proceeds without the queue being properly populated.
- Root Cause:
- A race condition caused by a weak guard; the failure to revalidate the condition after waking allows incorrect assumptions about the system state.
- Insight:
- This problem reinforces the importance of using guarded waits with while loops when working with condition variables, ensuring that threads only proceed when the condition they depend on is truly satisfied.
- Pseudo Code:

9.7. Lock Order Inversion: Deadlock from Inconsistent Lock Acquisition Order
- Description:
- In this classic concurrency scenario, two threads attempt to acquire two shared locks but do so in opposite orders. Thread 0 first locks mutex1 and then attempts to acquire mutex2, while Thread 1 begins by locking mutex2 and then proceeds to request mutex1. This inversion in lock acquisition order creates a circular wait condition: each thread holds one lock and waits indefinitely for the other to release the second, which never happens.
- Effect:
- A deadlock, where both threads are permanently blocked, unable to make progress.
- Root Cause:
- An incorrect order, a well-known concurrency design flaw where multiple threads acquire shared resources in inconsistent sequences. When such errors occur, they can easily lead to circular dependencies, especially in systems that lack a global lock acquisition policy.
- Insight:
- This problem exemplifies the dangers of uncoordinated locking strategies in multithreaded environments. It highlights the importance of enforcing a consistent global order for acquiring multiple locks, a practice that can prevent deadlocks and ensure system liveness. The scenario is a textbook case of “lock inversion”, a term often used to describe such deadlock-prone patterns in concurrent programming.
- Pseudo Code:

9.8. Lost Signal: Deadlock from Missed Signal in Condition Variable Coordination
- Description:
- Two threads coordinate using a shared condition variable. Thread 0 waits for a flag to become true using an if statement and then calls wait(). Thread 1 sets the flag and sends a notification using notify_all(). If Thread 1 sends the signal before Thread 0 begins waiting, the signal is lost, and Thread 0 waits indefinitely.
- Effect:
- A deadlock, as Thread 0 never receives the signal it depends on.
- Root Cause:
- A weak guard: Thread 0 fails to re-check the condition after waking and uses an if statement instead of a while loop to guard the wait.
- Insight:
- This problem reinforces a key principle in concurrent programming: condition variables must be used with guarded waits that revalidate the condition upon waking. This ensures correctness even in the presence of spurious wakeups or early notifications.
- Pseudo Code:

9.9. Partial Lock: Race Condition from Insufficient Lock Coverage
- Description:
- Two threads manipulate a shared variable i under a locking mechanism. Thread 0 increments i by 2 and checks whether i == 5, while Thread 1 decrements i by 1. Although both threads use a lock, the locking does not encompass all relevant operations or ensure proper coordination between them. As a result, the interleaving of operations can lead to unexpected values of i, potentially triggering assertion failures.
- Effect:
- An unexpected data or incorrect computation, as the shared state evolves in ways not anticipated by the program logic.
- Root Cause:
- A missing or weak guard due to the lock is not applied consistently across all accesses and updates to the shared variable, allowing unsafe interleaving.
- Insight:
- This example illustrates that merely using locks is not enough; they must be applied comprehensively and consistently to protect all shared state interactions.
- Pseudo Code:

9.10. Phantom Permit: Mutual Exclusion Violation from Semaphore Misuse
- Description:
- Two threads share a binary semaphore intended to serialise entry to a critical section. Thread 0 performs the canonical Wait–critical section–Release sequence, preserving mutual exclusion. Thread 1, by contrast, invokes Wait(timeout). If the timeout expires, it nevertheless executes Release, effectively inserting an extra permit into the semaphore (a “phantom’’ permit).
- Effect:
- Concurrent access arises when the phantom permit allows both threads to enter the critical section simultaneously, enabling interleaved operations that can corrupt shared state or violate higher-level invariants.
- Root Cause:
- The defect is rooted in a misuse of concurrency primitives: issuing Release without first holding the semaphore breaks the required one-to-one pairing of Wait/Release. This increases the semaphore’s count spuriously and defeats its mutual-exclusion guarantee.
- Insight:
- Correct semaphore protocols demand that every Release correspond to a successful Wait. Introducing time-limited waits without compensating logic must be done carefully; otherwise, phantom permits can emerge and silently undermine critical-section protection.
- Pseudo Code:

9.11. Race-To-Wait: Deadlock from Non-Atomic Coordination
- Description:
- Two threads attempt to synchronize based on a shared counter waiters. Each thread increments the counter and then waits for it to reach a specific value (e.g., 2) before proceeding. However, the increment and check operations on waiters are not atomic. Both threads may read the value 1 simultaneously before either has incremented it again, leading them both to wait forever for the counter to reach 2, which never happens.
- Effect:
- A classic deadlock, even though no explicit locking mechanism is involved.
- Root Cause:
- A non-atomic operation on shared state: the threads make decisions based on stale or incomplete views of shared memory.
- Insight:
- This example highlights how even minimalistic, lock-free coordination can result in liveness failures if atomicity is not respected.
- Pseudo Code:

9.12. Racy Increment: Race Condition from Non-Atomic Compound Operations
- Description:
- This problem illustrates a subtle but critical flaw in assuming that compound operations are atomic. Two threads execute the expression a = a + 1; if (a == 1) enter critical section, intending to allow only the first thread that increments a to 1 to enter the critical section. However, this logic fails under concurrent execution because the operation a = a + 1 is not atomic-it decomposes into a sequence of read, increment, and write steps. If both threads interleave during these steps, they may each observe a as 0, increment it to 1, and both proceed into the critical section.
- Effect:
- A concurrent access, where both threads enter a region that was intended to be accessed by only one. This leads to unexpected data, as the shared state is manipulated under the false assumption of exclusivity.
- Root Cause:
- A non-atomic operation stemming from the non-atomicity of the increment-and-check sequence. Without synchronization, the interleaving of operations allows both threads to satisfy the condition a == 1 simultaneously.
- Insight:
- This example underscores the importance of using atomic operations or explicit synchronization mechanisms, such as locks or atomic primitives, when accessing shared variables. It also highlights how deceptively simple code can harbor concurrency bugs if the underlying memory operations are not properly understood.
- Pseudo Code:

9.13. Semaphore Leak: Mutual Exclusion Violation from Semaphore Misuse
- Description:
- Involves two threads using a semaphore to control access to a critical section. Thread 0 follows the standard Wait–critical section–Release pattern. Thread 1, however, performs a time-limited Wait and calls Release regardless of whether it successfully acquired the semaphore.
- Effect:
- This behavior can corrupt the semaphore’s internal count, allowing multiple threads to enter the critical section simultaneously, a clear concurrent access.
- Root Cause:
- A misuse of primitive: releasing a semaphore without a corresponding acquisition violates the expected one-to-one pairing of Wait and Release.
- Insight:
- This example underscores the importance of maintaining strict discipline when using semaphores. Any deviation from the expected protocol can compromise the integrity of the synchronization mechanism.
- Pseudo Code:

9.14. Shared Counter: Mutual Exclusion Violation from Unsynchronized Counter
- Description:
- Involves two threads incrementing a shared counter and entering a critical section based on different thresholds, one at a count of 5, the other at 3. The counter is not protected by any synchronization mechanism, allowing updates to interleave unpredictably.
- Effect:
- Both threads may enter the critical section simultaneously or at unintended times, leading to a concurrent access and unexpected data.
- Root Cause:
- A race condition due to the non-atomic operation and check of the shared counter.
- Insight:
- This example demonstrates the necessity of synchronizing access to shared counters, especially when control flow decisions depend on their values. Without atomicity, even simple arithmetic can lead to concurrency failures.
- Pseudo Code:

9.15. Shared Flag: Mutual Exclusion Violation from Weak Boolean Flag Guard
- Description:
- Demonstrates the inadequacy of using a simple Boolean flag to enforce mutual exclusion. Two threads share a flag and use it to guard a critical section. Each thread spins while the flag is true, sets it to true, enters the critical section, and then resets it to false. However, the check (flag != false) and the update (flag = true) are not atomic. If one thread is preempted after checking the flag but before setting it, the other thread may also pass the check and set the flag, resulting in both threads entering the critical section concurrently.
- Effect:
- A concurrent access, where the critical section is accessed simultaneously by multiple threads, leading to potential data corruption or logic errors.
- Root Cause:
- A weak guard-the synchronization mechanism fails to ensure atomicity between the check and the update. This highlights the need for atomic test-and-set operations or proper locking mechanisms to enforce exclusive access.
- Insight:
- This highlights the need for atomic test-and-set operations or proper locking mechanisms to enforce exclusive access.
- Pseudo Code:

9.16. Signal-Then-Wait – Deadlock from Premature Signaling in Condition synchronization
- Description:
- Two threads coordinate using a shared flag and condition variable. The signaling thread sets the flag and calls notify_all() before the waiting thread has entered the blocking wait. Although the waiting thread uses a correct while guard around the condition variable, the notification is missed entirely because the thread was not waiting yet.
- Effect:
- A clear deadlock: the waiting thread blocks indefinitely, even though the condition it depends on was fulfilled. This occurs because condition variable signals do not persist - if a signal is sent before a thread is waiting, it is lost.
- Root Cause:
- An incorrect ordering of commands - the signal is issued before the synchronization context is established. This leads to a fundamental timing mismatch between threads.
- Insight:
- This pattern highlights that the timing of signal delivery in condition variable synchronization is critical. Signals must occur only after the corresponding wait condition has been armed, or the system risks falling into liveness failures such as deadlock.
- Pseudo Code:

9.17. Sleeping Guard: Deadlock from Missing or Weak Guard
- Description:
- Presents a subtle but powerful failure in the use of condition synchronization. A consumer thread checks a queue, and if it’s empty, sets a waiting flag and waits. A producer thread checks for the flag and enqueues data. The issue occurs if the producer enqueues a new item before the consumer sets the flag; the consumer misses the notification and remains blocked indefinitely.
- Effect:
- A classic deadlock, in which the consumer thread remains permanently blocked waiting for a signal that was sent before it armed the condition, while the producer continues indefinitely, leaving the system with no forward progress.
- Root Cause:
- A missing or weak guard: the consumer waits based solely on a flag without rechecking the real shared resource (the queue). In such designs, the wait must be governed by a guard that accurately reflects the synchronization invariant, and it must be re-evaluated after any wake-up event.
- Insight:
- Without such a recheck, typically enforced with a while loop, the thread risks sleeping forever, even though the condition it depends on has already been satisfied.
- Pseudo Code:

10. Limitations of the Proposed Approach
11. Summary and Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| BF | Brute-Force |
| BGU | Ben-Gurion University of the Negev |
| CI | Continuous Integration |
| CLT | Central Limit Theorem |
| Ens | Ensemble |
| GA | Genetic Algorithm |
| GP | Genetic Programming |
| LLN | Law of Large Numbers |
| HPC | High-Performance Computing |
| MLP | Multi-Layer Perceptron |
| PCT | Probabilistic Concurrency Testing |
| SA | Simulated Annealing |
| SD | Standard Deviation |
| SLURM | Simple Linux Utility for Resource Management |
| SMOTE | Minority Over-sampling Technique |
| SUT | System Under Test |
References
- Gray, J. Why Do Computers Stop and What Can Be Done About It? Technical Report 85.7, Tandem Computers, Palo Alto, CA, 1985. Accessed on. 16 July.
- Bakhshi, R.; Kunche, S.; Pecht, M. Intermittent Failures in Hardware and Software. Journal of Electronic Packaging 2014, 136, 011014. [Google Scholar] [CrossRef]
- Heidelberger, P. Fast simulation of rare events in queueing and reliability models. ACM Trans. Model. Comput. Simul. 1995, 5, 43–85. [Google Scholar] [CrossRef]
- Younes, H.L.; Simmons, R.G. Statistical probabilistic model checking with a focus on time-bounded properties. Information and Computation 2006, 204, 1368–1409. [Google Scholar] [CrossRef]
- Kumar, R.; Lee, J.; Padhye, R. Fray: An Efficient General-Purpose Concurrency Testing Platform for JVM. arXiv, 2501. [Google Scholar] [CrossRef]
- Burckhardt, S.; Kothari, P.; Musuvathi, M.; Nagarakatte, S. A Randomized Scheduler with Probabilistic Guarantees of Finding Bugs. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’10), Pittsburgh, PA, USA; 2010; pp. 167–178. [Google Scholar] [CrossRef]
- Zhao, H.; Wolff, D.; Mathur, U.; Roychoudhury, A. Selectively Uniform Concurrency Testing. Proceedings of the ACM on Programming Languages (ASPLOS) 2025, 5. [Google Scholar] [CrossRef]
- Ramesh, A.; Huang, T.; Riar, J.; Titzer, B.L.; Rowe, A. Unveiling Heisenbugs with Diversified Execution. ACM on Programming Languages 2025, 9, 393–420. [Google Scholar] [CrossRef]
- Godefroid, P.; Levin, M.Y.; Molnar, D.A. Effective Testing for Concurrency Bugs. Tech. rep. mpi–sws–2015–004, MPI–SWS, 2015. Accessed on 16 July 2025.
- Han, T.; Gong, X.; Liu, J. CARDSHARK: Understanding and Stabilizing Linux Kernel Concurrency Bugs Against the Odds. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security 24), Philadelphia, PA, USA; 2024; pp. 1867–1884. [Google Scholar]
- Bianchi, F.A.; Pezzè, M.; Terragni, V. A Search-Based Approach to Reproduce Crashes in Concurrent Programs. In Proceedings of the 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE), Paderborn, Germany; 2017; pp. 221–232. [Google Scholar] [CrossRef]
- Rasheed, S.; Dietrich, J.; Tahir, A. On the Effect of Instrumentation on Test Flakiness. In Proceedings of the 2023 IEEE/ACM International Conference on Automation of Software Test (AST), San Francisco, CA, USA; 2023; pp. 329–341. [Google Scholar] [CrossRef]
- Xu, J.; Wolff, D.; Han, X.; Li, J.; Roychoudhury, A. Concurrency Testing in the Linux Kernel via eBPF. arXiv, 2504. [Google Scholar] [CrossRef]
- Musuvathi, M.; Qadeer, S.; Ball, T.; Basler, G.; Nainar, P.A.; Neamtiu, I. Finding and Reproducing Heisenbugs in Concurrent Programs. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2008), San Diego, CA, USA; 2008; pp. 267–280. [Google Scholar]
- Shashank, S.S.; Sachdeva, J.; Mukherjee, S.; Deligiannis, P. Nekara: A Generalized Concurrency Testing Library. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), Melbourne, Australia; 2021; pp. 634–646. [Google Scholar] [CrossRef]
- Lee, S.; Zhang, H.; Viswanathan, M. Probabilistic Concurrency Testing for Weak Memory Programs. In Proceedings of the 28th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), Montreal, QC, Canada; 2023; pp. 133–147. [Google Scholar] [CrossRef]
- Elmas, T.; Burnim, J.; Necula, G.C.; Sen, K. CONCURRIT: A Domain Specific Language for Reproducing Concurrency Bugs. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13), Seattle, WA, USA; 2013; pp. 441–452. [Google Scholar] [CrossRef]
- Chen, Y.; Liu, S.; Gan, Q. Effective Concurrency Testing for Go via Directional Primitive Scheduling. In Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg; 2023; pp. 138–149. [Google Scholar] [CrossRef]
- Li, X.; Li, W.; Zhang, Y.; Zhang, L. DeepFL: Integrating Multiple Fault Diagnosis Dimensions for Deep Fault Localization. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’19). ACM; 2019; pp. 169–180. [Google Scholar] [CrossRef]
- Böttinger, K.; Godefroid, P.; Singh, R. Learn& Fuzz: Machine Learning for Input Fuzzing. arXiv 2018, abs/1701. [Google Scholar] [CrossRef]
- Amalfitano, D.; Faralli, S.; Hauck, J.C.R.; Matalonga, S.; Distante, D. Artificial Intelligence Applied to Software Testing: A Tertiary Study. ACM Computing Surveys 2023, 56, 1–29. [Google Scholar] [CrossRef]
- Leesatapornwongsa, T.; Lukman, J.F.; Lu, S.; Gunawi, H.S. TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems. In Proceedings of the 51st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’16), April 2016, Vol. 51, SIGPLAN Notices; pp. 517–530. [CrossRef]
- Sipper, M.; Green, B.; Ronen, Y.; Gat, T.; Hoffman, S.; Zohar, N. EC-KitY: Evolutionary computation tool kit in Python with seamless machine learning integration. SoftwareX 2023, 23, 101381. [Google Scholar] [CrossRef]
- Goldberg, D.E. Genetic Algorithms in Search, Optimization and Machine Learning; Addison-Wesley: Reading, MA, USA, 1989. Accessed on 16 July 2025.
- Karafotias, G.; Hoogendoorn, M.; Eiben, A.E. Parameter Control in Evolutionary Algorithms: Trends and Challenges. IEEE Transactions on Evolutionary Computation 2015, 19, 167–187. [Google Scholar] [CrossRef]
- Elyasaf, A.; Farchi, E.; Margalit, O.; Weiss, G.; Weiss, Y. Generalized Coverage Criteria for Combinatorial Sequence Testing. IEEE Transactions on Software Engineering 2023, 49, 4023–4034. [Google Scholar] [CrossRef]
- Wasserstein, R.L.; Lazar, N.A. The ASA’s Statement on p-Values: Context, Process, and Purpose. The American Statistician 2016, 70, 129–133. [Google Scholar] [CrossRef]
- Liu, K.; Chen, Z.; Liu, Y.; Zhang, J.M.; Harman, M.; Han, Y.; Ma, Y.; Dong, Y.; Li, G.; Huang, G. LLM-Powered Test Case Generation for Detecting Bugs in Plausible Programs. arXiv preprint arXiv:2404.10304, 2024; arXiv:2404.10304 2024. [Google Scholar] [CrossRef]
- Ouédraogo, W.C.; Plein, L.; Kaboré, K.; Habib, A.; Klein, J.; Lo, D.; Bissyandé, T.F. Enriching Automatic Test Case Generation by Extracting Relevant Test Inputs from Bug Reports. Empirical Software Engineering 2025, 30, 1–27. [Google Scholar] [CrossRef]
- Benavoli, A.; Corani, G.; Mangili, F. Should we really use post-hoc tests based on mean-ranks? CoRR 2015, 2288. [Google Scholar] [CrossRef]
- Might, M.; Horn, D.V. A Family of Abstract Interpretations for Static Analysis of Concurrent Higher-Order Programs. In Static Analysis (SAS 2011); Yahav, E., Ed.; Springer Berlin Heidelberg: Berlin, Heidelberg, 2011. [Google Scholar] [CrossRef]
- Bora, U.; Vaishay, S.; Joshi, S.; Upadrasta, R. OpenMP Aware MHP Analysis for Improved Static Data-Race Detection. In Proceedings of the 7th IEEE/ACM Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC ’21). IEEE/ACM; 2021; pp. 1–11. [Google Scholar] [CrossRef]
- Matsakis, N.D.; II, F.S.K. The Rust Language. Ada Letters 2014, 34, 103–104. [Google Scholar] [CrossRef]
- Godefroid, P.; Klarlund, N.; Sen, K. DART: Directed Automated Random Testing. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). Association for Computing Machinery; 2005; pp. 213–223. [Google Scholar] [CrossRef]
- Tehrani, A.; Khaleel, M.; Akbari, R.; Jannesari, A. DeepRace: Finding Data Race Bugs via Deep Learning. arXiv preprint arXiv:1907.07110, 2019; arXiv:1907.07110 2019. [Google Scholar] [CrossRef]
- Chen, H.; Guo, S.; Xue, Y.; Sui, Y.; Zhang, C.; Li, Y.; Wang, H.; Liu, Y. MUZZ: Thread-aware Grey-box Fuzzing for Effective Bug Hunting in Multithreaded Programs. In Proceedings of the 29th USENIX Security Symposium (USENIX Security ’20), Boston, MA, USA; 2020; pp. 2325–2342. [Google Scholar]
- Roemer, J.; Genç, K.; Bond, M.D. SmartTrack: Efficient Predictive Race Detection. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’20). ACM; 2020; pp. 747–762. [Google Scholar] [CrossRef]
- O’Callahan, R.; Jones, C.; Froyd, N.; Huey, K.; Noll, A.; Partush, N. Engineering Record And Replay For Deployability. In Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC ’17). USENIX Association; 2017; pp. 377–390. [Google Scholar]
- Holzmann, G.J. The Model Checker SPIN. IEEE Transactions on Software Engineering 1997, 23, 279–295. [Google Scholar] [CrossRef]
- Clarke, E.M.; Biere, A.; Raimi, R.; Zhu, Y. Bounded Model Checking Using Satisfiability Solving. Formal Methods in System Design 2001, 19, 7–34. [Google Scholar] [CrossRef]
- Clarke, E.M.; Grumberg, O.; Jha, S.; Lu, Y.; Veith, H. Counterexample-Guided Abstraction Refinement. In Proceedings of the 12th International Conference on Computer Aided Verification (CAV). Springer, Vol. 1855, Lecture Notes in Computer Science; 2000; pp. 154–169. [Google Scholar] [CrossRef]
- Namjoshi, K.S.; Trefler, R.J. Parameterized Compositional Model Checking. In Proceedings of the Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2016), Vol. 9636, Lecture Notes in Computer Science; 2016; pp. 589–606. [Google Scholar] [CrossRef]
- Legay, A.; Lukina, A.; Traonouez, L.; Yang, J.; Smolka, S.A.; Grosu, R. Statistical Model Checking. In Computing and Software Science; Springer Cham, 2019; Vol. 11506, Lecture Notes in Computer Science, pp. 478–504. [CrossRef]
- Xu, M.; Kashyap, S.; Zhao, H.; Kim, T. KRACE: Data Race Fuzzing for Kernel File Systems. In Proceedings of the 2020 IEEE Symposium on Security and Privacy (SP). IEEE; 2020; pp. 1643–1660. [Google Scholar] [CrossRef]
- Lu, S.; Park, S.; Seo, E.; Zhou, Y. Learning from Mistakes: A Comprehensive Study on Real World Concurrency Bug Characteristics. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’08), Seattle, WA, USA; 2008; pp. 329–339. [Google Scholar] [CrossRef]
- Musuvathi, M.; Qadeer, S.; Ball, T.; Basler, G.; Engler, D.R.; Foster, J.C.; Ghosh, A.K. Finding and Reproducing Heisenbugs in Concurrent Programs. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI), San Diego, CA, USA; 2008; pp. 267–280. [Google Scholar]
- Tian, Y.; Yu, Y.; Wang, P.; Zhou, R.; Jin, H.; Xie, T. RACEBENCH: A Benchmark Suite for Data Race Detection Tools. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE ’11). ACM; 2011; pp. 142–151. [Google Scholar] [CrossRef]
- Zhang, W.; Yao, C.; Lu, S.; Huang, J.; Tan, T.; Liu, X. ConSeq: Detecting Concurrency Bugs Through Sequential Errors. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’11). ACM; 2011; pp. 251–264. [Google Scholar] [CrossRef]
- Lin, Z.; Marinov, D.; Zhong, H.; Chen, Y.; Zhao, J. JaConTeBe: A Benchmark Suite of Real-World Java Concurrency Bugs. In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE ’15). IEEE / ACM; 2015; pp. 178–189. [Google Scholar] [CrossRef]
- Just, R.; Jalali, D.; Ernst, M.D. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (ISSTA ’14), San Jose, CA, USA, July 2014; pp. 437–440. [Google Scholar] [CrossRef]
- Madeiral, F.; Urli, S.; de Almeida Maia, M.; Monperrus, M. BEARS: An Extensible Java Bug Benchmark for Automatic Program Repair Studies. In Proceedings of the 26th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER ’19). IEEE; 2019; pp. 468–478. [Google Scholar] [CrossRef]
- Karampatsis, R.; Sutton, C. How Often Do Single-Statement Bugs Occur? In : The ManySStuBs4J Dataset. In Proceedings of the 17th International Conference on Mining Software Repositories (MSR ’20). ACM; 2020; pp. 573–577. [Google Scholar] [CrossRef]
- Tu, T.; Liu, X.; Song, L.; Zhang, Y. Understanding Real-World Concurrency Bugs in Go. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’19). ACM; 2019; pp. 865–878. [Google Scholar] [CrossRef]
- Yuan, T.; Li, G.; Lu, J.; Liu, C.; Li, L.; Xue, J. GoBench: A Benchmark Suite of Real-World Go Concurrency Bugs. In Proceedings of the 18th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’21). IEEE / ACM; 2021; pp. 187–199. [Google Scholar] [CrossRef]
- Torres, C.F.; Iannillo, A.K.; Gervais, A.; State, R. ConFuzzius: A Data Dependency-Aware Hybrid Fuzzer for Smart Contracts. In Proceedings of the 2021 IEEE European Symposium on Security and Privacy (EuroS&, 2021, P ’21). IEEE; pp. 213–228. [CrossRef]
| 1 | |
| 2 | |
| 3 |





| 1|c|Effect ∖ Root Cause | Missing/Weak Guard | Non-Atomic Op. | Incorrect Ordering | Misuse of Primitives |
|---|---|---|---|---|
| Deadlock | 6 (If-Not-While) 8 (Lost Signal) 17 (Sleeping Guard) |
11 (Race-To-Wait) | 7 (Lock Order Inversion) 16 (Signal-Then-Wait) |
2 (Broken Barrier) 5 (Flagged Deadlock) |
| Unexpected Data | 6 (If-Not-While) 9 (Partial Lock) |
12 (Racy Increment) 14 (Shared Counter) |
4 (Delayed Write) | 1 (Atomicity Bypass) |
| Concurrent Access | 3 (Broken Peterson) 15 (Shared Flag) |
12 (Racy Increment) 14 (Shared Counter) |
4 (Delayed Write) | 10 (Phantom Permit) 13 (Semaphore Leak) |
| Method | Exploration | Exploitation | Notes |
|---|---|---|---|
| random | - | k is the minimum repetition required | |
| SA | k per step | steps | |
| GA | pop. k per generation | generations | |
| Ens | add 100 random/iter | add 100 ranked/iter | Full budget trains model each iter |
| Problem | Ens→GA | Ens→BF | Ens→SA | GA→BF | GA→SA | BF→SA |
|---|---|---|---|---|---|---|
| AtomicityBypass | 0.002 | <0.001 | <0.001 | 0.566 | <0.001 | <0.001 |
| BrokenBarrier | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 |
| BrokenPeterson | 0.002 | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 |
| DelayedWrite | 0.003 | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 |
| FlagedDeadlock | 0.003 | 0.002 | <0.001 | <0.001 | <0.001 | <0.001 |
| IfNotWhile | 0.003 | 0.003 | 0.001 | <0.001 | <0.001 | <0.001 |
| LockOrderInversion | 0.984 | 0.434 | 0.003 | 0.054 | <0.001 | <0.001 |
| LostSignal | <0.001 | <0.001 | <0.001 | 0.174 | <0.001 | <0.001 |
| PartialLock | 0.295 | 0.214 | 0.003 | 0.130 | <0.001 | <0.001 |
| PhantomPermit | 0.003 | 0.003 | 0.003 | 0.127 | 0.003 | 0.011 |
| RaceToWait | 0.007 | 0.003 | 0.003 | <0.001 | <0.001 | 0.996 |
| RacyIncrement | 0.003 | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 |
| SemaphoreLeak | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 |
| SharedCounter | 0.003 | 0.003 | 0.001 | <0.001 | <0.001 | <0.001 |
| SharedFlag | 0.003 | 0.003 | <0.001 | <0.001 | <0.001 | <0.001 |
| SignalThenWait | 0.002 | <0.001 | <0.001 | <0.001 | <0.001 | 0.014 |
| SleepingGuard | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).



