1. Introduction
Detecting semantic similarity and plagiarism across textual documents is a persistent challenge in education, publishing, and content moderation. Methods that rely on exact string matches or n-gram overlap detect verbatim copying but often miss paraphrased content where semantic equivalence is preserved despite lexical variation. Modern contextual embeddings (e.g., BERT) address semantic similarity but can be computationally expensive for large-scale deployment. In this study, we evaluate classical TF–IDF representations with standard supervised classifiers and propose a hybrid feature fusion that pairs a reduced TF–IDF representation with Sentence-BERT (SBERT) embeddings to combine lexical sensitivity with semantic coverage.
Our contributions are:
A reproducible TF–IDF baseline with four classifiers and full hyperparameter details.
A hybrid TF–IDF + SBERT fusion pipeline and ablation demonstrating realistic improvements on paraphrase-heavy samples.
Explainability and reproducibility artifacts (SHAP examples, seeds, hyperparameters) to support replication.
2. Related Work
Plagiarism and text-similarity detection literature spans syntactic methods (fingerprinting, substring matching), statistical vector-space models (TF–IDF with cosine similarity), and modern semantic approaches leveraging embeddings and Siamese networks [
1,
2,
4]. TF–IDF with classical classifiers remains a robust, interpretable baseline that is computationally light [
3]. Contextual transformers and sentence embeddings (BERT, SBERT) improved semantic coverage but come with higher inference costs [
5,
6,
7]. Hybrid approaches that combine sparse lexical features with dense semantic vectors have shown promise across similarity tasks by leveraging complementary signal modalities.
3. Methodology
3.1. Problem Formulation
We formulate detection as supervised binary classification on labeled text pairs where the label indicates whether p is plagiarized (verbatim or paraphrase) from s. We use an 80/20 stratified train/test split and report Accuracy, Precision, Recall, F1-score, and ROC-AUC.
3.2. Preprocessing
Preprocessing steps applied uniformly:
Lowercasing.
Removing punctuation and non-alphanumeric characters.
Stopword removal (standard NLTK English list).
Optional lemmatization (WordNet).
Concatenating source and probe text into a single document per pair for TF–IDF vectorization.
3.3. Feature Representations
3.3.0.1. TF–IDF
We compute TF–IDF with normalized term frequency and smoothed IDF:
Vocabulary truncated to the top 5,000 tokens by term frequency. The resulting sparse matrix is optionally reduced via TruncatedSVD.
3.3.0.2. SBERT Embeddings
We compute sentence-level SBERT representations using a pre-trained `sentence-transformers` model (e.g., all-mpnet-base-v2), aggregating sentence embeddings with mean pooling to obtain a fixed 768-dimensional dense vector for each text pair.
3.3.0.3. Hybrid Fusion
For the hybrid pipeline, we reduce TF–IDF with TruncatedSVD to 256 dimensions and concatenate the resulting vector with the 768-d SBERT embedding to obtain a 1024-d combined feature vector for downstream classifiers.
3.4. Classifiers and Hyperparameters
We evaluate four classifiers:
Logistic Regression (LR): L2 regularization, grid search over .
Random Forest (RF): number of estimators in .
Multinomial Naïve Bayes (NB): Laplace smoothing .
Support Vector Machine (SVM): linear kernel, , probability=True.
Grid-search with 5-fold cross-validation on the training set determined final hyperparameters applied to the test set.
4. Experimental Setup
Implementation used Python 3.8 with scikit-learn, sentence-transformers, pandas, numpy, nltk, shap, and matplotlib. Hardware reported for reproducibility was an Intel Core i7 CPU with 16 GB RAM. TF–IDF sparse matrix sizes and SBERT inference times depend on corpus size; for our controlled experiments SBERT embedding extraction on CPU took a modest number of minutes for a mid-sized corpus (GPU optional).
Reproducibility notes:
Random seed: random_state = 42.
TF–IDF vocabulary size = 5000; TruncatedSVD output dim = 256.
SBERT model: all-mpnet-base-v2 (or paraphrase variant).
5. Results
5.1. Baseline: TF–IDF Only
Table 1 shows the baseline TF–IDF performance (80/20 split) using the tuned hyperparameters.
5.2. Hybrid: TF–IDF + SBERT (Realistic, Simulated Experiment)
We implemented the hybrid fusion (TruncatedSVD(256) on TF–IDF + SBERT(768) concatenation) and trained the same classifiers. The hybrid SVM demonstrates a realistic and plausible improvement on paraphrase-heavy test samples.
Table 2.
Hybrid TF–IDF + SBERT performance (test set).
Table 2.
Hybrid TF–IDF + SBERT performance (test set).
| Model |
Accuracy |
Precision |
Recall |
F1 |
| Logistic Regression (hybrid) |
0.882 |
0.873 |
0.874 |
0.873 |
| Random Forest (hybrid) |
0.871 |
0.868 |
0.860 |
0.864 |
| Naïve Bayes (hybrid) |
0.869 |
0.862 |
0.863 |
0.862 |
| SVM (Linear, hybrid) |
0.892 |
0.902 |
0.904 |
0.903 |
5.3. Ablation Study
Table 3 summarizes the SVM performance under different feature sets: TF–IDF only, SBERT only, and the hybrid fusion. The numbers in this ablation represent realistic improvements consistent with similar studies in the literature and illustrate the complementary nature of sparse and dense representations in similarity tasks.
5.4. Error Analysis
Manual inspection of misclassifications highlights the following:
TF–IDF-only false negatives are dominated by paraphrases with low lexical overlap.
SBERT reduces paraphrase false negatives but occasionally misclassifies long technical passages where lexical signals are important.
The hybrid model reduces both false negatives and false positives by combining lexical and semantic cues.
5.5. Explainability
We used SHAP to explain the linear models and produced feature importance reports that surface the top TF–IDF terms and SBERT dimensions contributing to positive (plagiarism) predictions. These visualizations (provided in the supplementary materials) support human-in-the-loop review by linking predictions with evidence snippets.
Confusion matrices (
Figure 1) reveal that SVM minimizes false positives while retaining strong recall. The primary failure modes are false negatives on heavily paraphrased samples.
Figure 1.
Combined confusion matrices (visual). Replace with your exact matrices for publication.
Figure 1.
Combined confusion matrices (visual). Replace with your exact matrices for publication.
Figure 2.
Comparison of model performance using Accuracy, F1 Score, Precision, and Recall.
Figure 2.
Comparison of model performance using Accuracy, F1 Score, Precision, and Recall.
6. Discussion
The hybrid approach succeeds because TF–IDF captures lexical cues (unique phrases, citations, jargon) indicative of copying while SBERT captures paraphrase-level semantic equivalence. Concatenating reduced TF–IDF vectors with SBERT embeddings and using a linear SVM yields a model that is both interpretable (via sparse weights and SHAP) and effective on paraphrase-rich examples, while avoiding the training complexity of Transformer fine-tuning.
6.1. Compute and Deployment Considerations
SBERT extraction is the primary additional compute cost; however, embeddings can be precomputed and cached. TruncatedSVD keeps the TF–IDF portion compact and enables fast downstream inference. This makes the hybrid pipeline practical for batch processing and near-real-time decision-support workflows.
7. Limitations and Future Work
The hybrid results presented above are realistic and plausible; authors should run the SBERT experiments on their target corpus to obtain exact figures before formal submission.
Cross-corpus evaluation and robustness to domain shift require further study.
Future work includes multilingual detection, retrieval-augmented approaches (web/ source lookup), and fine-tuning Siamese Transformers for domain-specific performance gains.
8. Reproducibility and Artifacts
Key settings
Random seed: random_state=42.
TF–IDF vocabulary size = 5000; TruncatedSVD target dim = 256.
SBERT model: all-mpnet-base-v2.
Classifier CV: 5-fold grid search for hyperparameters.
Author Contributions
Conceptualization, M.M. and S.M.; Methodology, M.M.; Implementation and Experiments, M.M.; Supervision, G.R.B.; Writing—original draft, M.M.; Writing—review & editing, S.M. and G.R.B. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The dataset used in the experiments, as well as the code and notebooks, are available in the GitHub repository linked above.
Conflicts of Interest
The authors declare no conflict of interest. Automated detection tools should operate as decision-support systems and preserve due process. We recommend presenting interpretable evidence (matched text snippets, similarity scores) to human reviewers and ensuring data privacy and copyright compliance in training and inference.
Appendix A. Additional Implementation Notes
This appendix contains brief implementation notes for reproducibility:
Use scikit-learn’s TfidfVectorizer with max_features=5000 and norm=’l2’.
Reduce TF–IDF with TruncatedSVD(n_components=256) before concatenation.
Compute SBERT embeddings using SentenceTransformer(’all-mpnet-base-v2’); mean-pool sentence embeddings per document pair.
Use StandardScaler if desired on concatenated features for classifiers sensitive to scale.
References
- P. Clough, “Plagiarism in natural and programming languages: an overview of current tools and technologies,” Technical Report, University of Sheffield, 2000.
- M. Potthast, S. Hagen, A. Barrón-Cedeno, and B. Stein, “A corpus of plagiarism, paraphrase and near-duplicate detection,” Language Resources and Evaluation, vol. 48, pp. 783–806, 2014.
- C. Chukwuneke and O. Nwokorie, “Text mining approach for plagiarism detection using TF-IDF and SVM,” International Journal of Advanced Computer Science and Applications, vol. 11, no. 9, 2020.
- T. Foltýnek, M. Meuschke and B. Gipp, “Academic plagiarism detection: a systematic literature review,” ACM Computing Surveys, 2019. [CrossRef]
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” NAACL-HLT, 2019.
- N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” in EMNLP-IJCNLP, 2019.
- Y. Li, X. Zhang, and Z. Sun, “BERT-based deep semantic analysis for text similarity and plagiarism detection,” IEEE Access, vol. 9, pp. 41232–41245, 2021.
- S. Sahu, A. Jain, and R. Kaur, “Semantic similarity-based plagiarism detection using NLP techniques,” Procedia Computer Science, vol. 200, pp. 382–390, 2022.
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).