3. Materials and Methods
The proposed duplicate error detection system uses a structured approach that integrates search and classification models, ensuring efficient and accurate identification of duplicate errors. A visual representation of the algorithm is shown in
Figure 1. The system consists of the following key steps:
Each new error undergoes minimal pre-processing, during which the title and description are extracted and combined into a single text block. In previous works, these fields were used as the main signal sources for text search and similarity [
18], as well as in modern two-stage candidate selection systems [
32]. Preliminary experiments were conducted to find the optimal way to represent the text. Vectorisation of only the title, only the description, and combined vectorisation of the title and description were evaluated, as well as additional pre-processing steps such as stop word removal, lemmatisation, and stemming. The results showed that combining the title and description provides the most informative representation, as it covers both the general summary of the problem and its detailed description. Further text normalisation had a minimal impact on performance, as the transformer models considered in the next step are capable of handling these aspects on their own.
- 2.
Generation of vector representations
The processed text blocks of errors are converted into dense vector representations using pre-trained transformer models. For the
-th error report with the title
and description
the vectors are calculated using the formula:
where
— is the text vectorisation transformer model, and
denotes concatenation.
- 3.
Indexing and searching in ChromaDB
The generated vector representations are stored in ChromaDB, a vector database that enables efficient search for approximate nearest neighbours for large-scale selection, as demonstrated in [
31]. ChromaDB provides fast search for potential duplicates by organising vector representations into a structured index optimised for similarity search. For a new bug report
the system retrieves the K most similar reports based on cosine similarity:
where
represents the vectors stored in ChromaDB.
- 4.
Classification and refinement
Selected candidates are further analysed using machine learning classification models to determine whether they are true duplicates. The classifier is trained on unlabelled pairs of bug reports, with vector representations of text blocks concatenated as follows
where
— is the classification model.
- 5.
Final decision
If the system detects a duplicate with high confidence, the error report is linked to an existing issue. If no duplicate is found, the report is not associated with any of the existing errors. The proposed hybrid approach, combining semantic search and classification, ensures scalability and accuracy in duplicate detection, effectively handling the variability of error descriptions.
The dataset used in this study is BugHub, a large-scale collection of bug reports obtained from several open-source projects. It contains structured metadata, including titles, descriptions, timestamps, components, and references to duplicates. BugHub is a well-suited dataset for duplicate detection research because it contains labelled duplicate relationships, allowing for the evaluation of both search and classification models.
Each bug report contains the following fields selected for analysis:
Bug ID: A unique identifier for each report;
Title: A short text description of the problem;
Description: A detailed explanation of the problem, including steps to reproduce, relevant versions, and a comparison of expected and actual behaviour;
Duplicated Issue: An indication of whether this bug has been marked as a duplicate and the number of the corresponding pair.
This study focuses on five key projects in BugHub: Mozilla Core, Firefox, Thunderbird, Eclipse Platform, and JDT. These projects cover a variety of domains, from web browsers to integrated development environments, ensuring that our findings are generalisable to different types of software.
Figure 2 provides an overview of the dataset, including the total number of reported issues and the number of confirmed duplicates for each project.
To prepare the dataset for experiments, we form a diverse and representative set of pairs of bug reports. This combination covers a wide range of similarity levels, from identical duplicates to closely related but distinct reports, as well as completely dissimilar bugs. Accordingly, the following three types of relationships are combined:
duplicates within a single duplicate group — reports that are explicitly marked as relating to the same issue;
duplicates from different groups within a single project — pairs originating from separate duplicate groups within a single project. They may be semantically similar but not marked as exact duplicates, which helps the classifier learn to distinguish between similar and unrelated reports;
non-duplicates — randomly selected reports from the same project that have no duplicate connections.
The dataset is divided into a training set of 10,000 pairs of error reports and a test set of 2,000 pairs, with carefully balanced ratios of duplicates and non-duplicates. In the training set, duplicates and non-duplicates are distributed evenly (50% duplicates, 50% non-duplicates), which ensures uniform training of the classifier. However, the test set reflects the real conditions of error tracking, where duplicates occur much less frequently. As a result, only 20% of test pairs are duplicates, while 80% are not. This deliberate imbalance ensures that the classifier is evaluated under realistic conditions, making it more robust to deployment in situations where unrelated errors significantly outweigh true duplicates.
Since the task of detecting duplicates involves finding reports that are similar in content, special attention was paid to transformer models that allow for the construction of contextually rich vector representations of text. BERT [
22], MiniLM [
24], and MPNet [
25] were selected because they provide a different balance between speed and accuracy. BERT is a powerful base model that forms multi-valued text representations, but it is resource-intensive and not optimised for semantic search. MiniLM, thanks to its compactness, speeds up computations without significant loss of quality. MPNet demonstrates improved ability to model text semantics, which is especially important for recognising rephrased error descriptions.
Several traditional machine learning algorithms were considered for classifying potential duplicates: logistic regression, support vector machines, and XGBoost. Logistic regression allows for effective estimation of the probability that two reports describe the same problem, making it a simple and interpretable baseline model. The support vector machine (SVM) method was tested due to its ability to work with high-dimensional spaces and find nonlinear relationships. To increase speed and adaptability to complex data distributions, XGBoost was used. Leveraging gradient boosting mechanisms, it takes into account previous errors and improves classification results.
The combination of transformer models for obtaining vector representations and classical machine learning algorithms for classification made it possible to develop an effective system for searching for duplicate error reports. This approach takes into account both the accuracy of detecting similar records and the speed of processing, which is critical for integration into real workflows.
Several key metrics are used to evaluate search and classification performance: Recall@k, Accuracy, Precision, Recall, F1-measure and ROC curve (Receiver Operating Characteristic).
Recall@k, or search completeness, determines the proportion of relevant items that appear in the first found k results, which is especially important when only the most relevant results are considered. It is calculated using the formula:
Accuracy reflects the overall correctness of the classification model by measuring the ratio of correctly classified cases (both positive and negative) to the total number of cases. The formula is given below:
where:
TP true positive predictions (correctly classified positive cases);
TN true negative predictions (correctly classified negative cases);
FP false positive predictions (incorrectly classified as positive);
FN false negative predictions (incorrectly classified as negative).
Precision, or positive predictive value, measures the accuracy of positive predictions by determining the ratio of true positive predictions to the total number of predicted positive cases, calculated as follows
Recall, or the true positive prediction rate, assesses the model's ability to identify all relevant cases:
F1-measure is a harmonic mean of Precision and Recall, providing a single metric that balances accuracy and completeness. It is particularly useful when classes are unevenly distributed in a dataset. The following formula is:
However, to fully evaluate the model's performance, it is important to consider not only the balance between accuracy and completeness, but also the relationship between sensitivity and specificity at different classification thresholds, which is best demonstrated on the ROC curve.
Sensitivity is another name for the Recall metric and is presented in the formula. Specificity measures how well the model recognises unique reports while avoiding false positive predictions. It is calculated using the formula: