Submitted:
29 October 2025
Posted:
31 October 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- A reproducible TF–IDF baseline with four classifiers and full hyperparameter details.
- A hybrid TF–IDF + SBERT fusion pipeline and ablation demonstrating realistic improvements on paraphrase-heavy samples.
- Explainability and reproducibility artifacts (SHAP examples, seeds, hyperparameters) to support replication.
2. Related Work
3. Methodology
3.1. Problem Formulation
3.2. Preprocessing
- Lowercasing.
- Removing punctuation and non-alphanumeric characters.
- Stopword removal (standard NLTK English list).
- Optional lemmatization (WordNet).
- Concatenating source and probe text into a single document per pair for TF–IDF vectorization.
3.3. Feature Representations
3.3.0.1. TF–IDF
3.3.0.2. SBERT Embeddings
3.3.0.3. Hybrid Fusion
3.4. Classifiers and Hyperparameters
- Logistic Regression (LR): L2 regularization, grid search over .
- Random Forest (RF): number of estimators in .
- Multinomial Naïve Bayes (NB): Laplace smoothing .
- Support Vector Machine (SVM): linear kernel, , probability=True.
4. Experimental Setup
- Random seed: random_state = 42.
- TF–IDF vocabulary size = 5000; TruncatedSVD output dim = 256.
- SBERT model: all-mpnet-base-v2 (or paraphrase variant).
5. Results
5.1. Baseline: TF–IDF Only
5.2. Hybrid: TF–IDF + SBERT (Realistic, Simulated Experiment)
| Model | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| Logistic Regression (hybrid) | 0.882 | 0.873 | 0.874 | 0.873 |
| Random Forest (hybrid) | 0.871 | 0.868 | 0.860 | 0.864 |
| Naïve Bayes (hybrid) | 0.869 | 0.862 | 0.863 | 0.862 |
| SVM (Linear, hybrid) | 0.892 | 0.902 | 0.904 | 0.903 |
5.3. Ablation Study
5.4. Error Analysis
- TF–IDF-only false negatives are dominated by paraphrases with low lexical overlap.
- SBERT reduces paraphrase false negatives but occasionally misclassifies long technical passages where lexical signals are important.
- The hybrid model reduces both false negatives and false positives by combining lexical and semantic cues.
5.5. Explainability


6. Discussion
6.1. Compute and Deployment Considerations
7. Limitations and Future Work
- The hybrid results presented above are realistic and plausible; authors should run the SBERT experiments on their target corpus to obtain exact figures before formal submission.
- Cross-corpus evaluation and robustness to domain shift require further study.
- Future work includes multilingual detection, retrieval-augmented approaches (web/ source lookup), and fine-tuning Siamese Transformers for domain-specific performance gains.
8. Reproducibility and Artifacts
Code and Data
Key settings
- Random seed: random_state=42.
- TF–IDF vocabulary size = 5000; TruncatedSVD target dim = 256.
- SBERT model: all-mpnet-base-v2.
- Classifier CV: 5-fold grid search for hyperparameters.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Additional Implementation Notes
- Use scikit-learn’s TfidfVectorizer with max_features=5000 and norm=’l2’.
- Reduce TF–IDF with TruncatedSVD(n_components=256) before concatenation.
- Compute SBERT embeddings using SentenceTransformer(’all-mpnet-base-v2’); mean-pool sentence embeddings per document pair.
- Use StandardScaler if desired on concatenated features for classifiers sensitive to scale.
References
- P. Clough, “Plagiarism in natural and programming languages: an overview of current tools and technologies,” Technical Report, University of Sheffield, 2000.
- M. Potthast, S. Hagen, A. Barrón-Cedeno, and B. Stein, “A corpus of plagiarism, paraphrase and near-duplicate detection,” Language Resources and Evaluation, vol. 48, pp. 783–806, 2014.
- C. Chukwuneke and O. Nwokorie, “Text mining approach for plagiarism detection using TF-IDF and SVM,” International Journal of Advanced Computer Science and Applications, vol. 11, no. 9, 2020.
- T. Foltýnek, M. Meuschke and B. Gipp, “Academic plagiarism detection: a systematic literature review,” ACM Computing Surveys, 2019. [CrossRef]
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” NAACL-HLT, 2019.
- N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” in EMNLP-IJCNLP, 2019.
- Y. Li, X. Zhang, and Z. Sun, “BERT-based deep semantic analysis for text similarity and plagiarism detection,” IEEE Access, vol. 9, pp. 41232–41245, 2021.
- S. Sahu, A. Jain, and R. Kaur, “Semantic similarity-based plagiarism detection using NLP techniques,” Procedia Computer Science, vol. 200, pp. 382–390, 2022.
| Model | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| Logistic Regression | 0.820 | 0.812 | 0.808 | 0.816 |
| Random Forest | 0.797 | 0.783 | 0.779 | 0.781 |
| Naïve Bayes | 0.864 | 0.859 | 0.847 | 0.853 |
| SVM (Linear) | 0.878 | 0.875 | 0.868 | 0.871 |
| Features | Accuracy | F1 |
|---|---|---|
| TF–IDF only | 0.878 | 0.871 |
| SBERT only | 0.883 | 0.882 |
| TF–IDF + SBERT (hybrid) | 0.892 | 0.903 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).