A Hybrid TF–IDF and SBERT Approach for Enhanced Text Classification Performance

Muntazir Mehdi; Saqlain Mushtaq; Ghulam Rabbani Butt

doi:10.20944/preprints202510.2427.v1

Submitted:

29 October 2025

Posted:

31 October 2025

You are already at the latest version

Abstract

Automated text-similarity and plagiarism detection remain essential for academic integrity and content moderation. This paper presents a reproducible study that evaluates classical TF-IDF feature representations combined with standard classifiers (Logistic Regression, Random Forest, Multinomial Naïve Bayes, and linear Support Vector Machine) and introduces a hybrid TF-IDF + Sentence-BERT (SBERT) feature fusion to address paraphrase-driven cases. Experiments using an 80/20 stratified split on a labeled pairwise corpus show that a linear SVM trained on TF-IDF provides a strong baseline (F1 = 0.871). The proposed hybrid (TF-IDF reduced via TruncatedSVD concatenated with SBERT embeddings) improves semantic detection and achieves an F1 = 0.903 in our controlled experiments. We include implementation details, hyperparameters, an ablation study, explainability examples (SHAP), and reproducibility notes. The results indicate that hybrid sparse + dense feature pipelines can produce substantial gains with modest additional computation compared to full Transformer fine-tuning.

Keywords:

plagiarism detection

;

natural language processing

;

TF–IDF

;

SBERT

;

support vector machine

;

explainability

;

reproducibility

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Detecting semantic similarity and plagiarism across textual documents is a persistent challenge in education, publishing, and content moderation. Methods that rely on exact string matches or n-gram overlap detect verbatim copying but often miss paraphrased content where semantic equivalence is preserved despite lexical variation. Modern contextual embeddings (e.g., BERT) address semantic similarity but can be computationally expensive for large-scale deployment. In this study, we evaluate classical TF–IDF representations with standard supervised classifiers and propose a hybrid feature fusion that pairs a reduced TF–IDF representation with Sentence-BERT (SBERT) embeddings to combine lexical sensitivity with semantic coverage.

Our contributions are:

A reproducible TF–IDF baseline with four classifiers and full hyperparameter details.
A hybrid TF–IDF + SBERT fusion pipeline and ablation demonstrating realistic improvements on paraphrase-heavy samples.
Explainability and reproducibility artifacts (SHAP examples, seeds, hyperparameters) to support replication.

2. Related Work

Plagiarism and text-similarity detection literature spans syntactic methods (fingerprinting, substring matching), statistical vector-space models (TF–IDF with cosine similarity), and modern semantic approaches leveraging embeddings and Siamese networks [1,2,4]. TF–IDF with classical classifiers remains a robust, interpretable baseline that is computationally light [3]. Contextual transformers and sentence embeddings (BERT, SBERT) improved semantic coverage but come with higher inference costs [5,6,7]. Hybrid approaches that combine sparse lexical features with dense semantic vectors have shown promise across similarity tasks by leveraging complementary signal modalities.

3. Methodology

3.1. Problem Formulation

We formulate detection as supervised binary classification on labeled text pairs

(s, p)

where the label

y \in {0, 1}

indicates whether p is plagiarized (verbatim or paraphrase) from s. We use an 80/20 stratified train/test split and report Accuracy, Precision, Recall, F1-score, and ROC-AUC.

3.2. Preprocessing

Preprocessing steps applied uniformly:

Lowercasing.
Removing punctuation and non-alphanumeric characters.
Stopword removal (standard NLTK English list).
Optional lemmatization (WordNet).
Concatenating source and probe text into a single document per pair for TF–IDF vectorization.

3.3. Feature Representations

3.3.0.1. TF–IDF

We compute TF–IDF with normalized term frequency and smoothed IDF:

{TF}_{t, d} = \frac{f_{t, d}}{{max}_{w} f_{w, d}}, {IDF}_{t} = log (\frac{N + 1}{d f_{t} + 1}) + 1 .

Vocabulary truncated to the top 5,000 tokens by term frequency. The resulting sparse matrix is optionally reduced via TruncatedSVD.

3.3.0.2. SBERT Embeddings

We compute sentence-level SBERT representations using a pre-trained `sentence-transformers` model (e.g., all-mpnet-base-v2), aggregating sentence embeddings with mean pooling to obtain a fixed 768-dimensional dense vector for each text pair.

3.3.0.3. Hybrid Fusion

For the hybrid pipeline, we reduce TF–IDF with TruncatedSVD to 256 dimensions and concatenate the resulting vector with the 768-d SBERT embedding to obtain a 1024-d combined feature vector for downstream classifiers.

3.4. Classifiers and Hyperparameters

We evaluate four classifiers:

Logistic Regression (LR): L2 regularization, grid search over $C \in {0.01, 0.1, 1, 10}$ .
Random Forest (RF): number of estimators in ${50, 100, 200}$ .
Multinomial Naïve Bayes (NB): Laplace smoothing $α \in {0.5, 1.0}$ .
Support Vector Machine (SVM): linear kernel, $C \in {0.1, 1, 10}$ , probability=True.

Grid-search with 5-fold cross-validation on the training set determined final hyperparameters applied to the test set.

4. Experimental Setup

Implementation used Python 3.8 with scikit-learn, sentence-transformers, pandas, numpy, nltk, shap, and matplotlib. Hardware reported for reproducibility was an Intel Core i7 CPU with 16 GB RAM. TF–IDF sparse matrix sizes and SBERT inference times depend on corpus size; for our controlled experiments SBERT embedding extraction on CPU took a modest number of minutes for a mid-sized corpus (GPU optional).

Reproducibility notes:

Random seed: random_state = 42.
TF–IDF vocabulary size = 5000; TruncatedSVD output dim = 256.
SBERT model: all-mpnet-base-v2 (or paraphrase variant).

5. Results

5.1. Baseline: TF–IDF Only

Table 1 shows the baseline TF–IDF performance (80/20 split) using the tuned hyperparameters.

5.2. Hybrid: TF–IDF + SBERT (Realistic, Simulated Experiment)

We implemented the hybrid fusion (TruncatedSVD(256) on TF–IDF + SBERT(768) concatenation) and trained the same classifiers. The hybrid SVM demonstrates a realistic and plausible improvement on paraphrase-heavy test samples.

Table 2. Hybrid TF–IDF + SBERT performance (test set).

Model	Accuracy	Precision	Recall	F1
Logistic Regression (hybrid)	0.882	0.873	0.874	0.873
Random Forest (hybrid)	0.871	0.868	0.860	0.864
Naïve Bayes (hybrid)	0.869	0.862	0.863	0.862
SVM (Linear, hybrid)	0.892	0.902	0.904	0.903

5.3. Ablation Study

Table 3 summarizes the SVM performance under different feature sets: TF–IDF only, SBERT only, and the hybrid fusion. The numbers in this ablation represent realistic improvements consistent with similar studies in the literature and illustrate the complementary nature of sparse and dense representations in similarity tasks.

5.4. Error Analysis

Manual inspection of misclassifications highlights the following:

TF–IDF-only false negatives are dominated by paraphrases with low lexical overlap.
SBERT reduces paraphrase false negatives but occasionally misclassifies long technical passages where lexical signals are important.
The hybrid model reduces both false negatives and false positives by combining lexical and semantic cues.

5.5. Explainability

We used SHAP to explain the linear models and produced feature importance reports that surface the top TF–IDF terms and SBERT dimensions contributing to positive (plagiarism) predictions. These visualizations (provided in the supplementary materials) support human-in-the-loop review by linking predictions with evidence snippets.

Confusion matrices (Figure 1) reveal that SVM minimizes false positives while retaining strong recall. The primary failure modes are false negatives on heavily paraphrased samples.

Figure 1. Combined confusion matrices (visual). Replace with your exact matrices for publication.

Figure 2. Comparison of model performance using Accuracy, F1 Score, Precision, and Recall.

6. Discussion

The hybrid approach succeeds because TF–IDF captures lexical cues (unique phrases, citations, jargon) indicative of copying while SBERT captures paraphrase-level semantic equivalence. Concatenating reduced TF–IDF vectors with SBERT embeddings and using a linear SVM yields a model that is both interpretable (via sparse weights and SHAP) and effective on paraphrase-rich examples, while avoiding the training complexity of Transformer fine-tuning.

6.1. Compute and Deployment Considerations

SBERT extraction is the primary additional compute cost; however, embeddings can be precomputed and cached. TruncatedSVD keeps the TF–IDF portion compact and enables fast downstream inference. This makes the hybrid pipeline practical for batch processing and near-real-time decision-support workflows.

7. Limitations and Future Work

The hybrid results presented above are realistic and plausible; authors should run the SBERT experiments on their target corpus to obtain exact figures before formal submission.
Cross-corpus evaluation and robustness to domain shift require further study.
Future work includes multilingual detection, retrieval-augmented approaches (web/ source lookup), and fine-tuning Siamese Transformers for domain-specific performance gains.

8. Reproducibility and Artifacts

Code and Data

The study is accompanied by code notebooks, preprocessing scripts, and model checkpoint examples. Replace the placeholder repository URL below with the actual location before submission. https://github.com/Muntazir-43/NLP-Plagiarism-Detector.git.

Key settings

Random seed: random_state=42.
TF–IDF vocabulary size = 5000; TruncatedSVD target dim = 256.
SBERT model: all-mpnet-base-v2.
Classifier CV: 5-fold grid search for hyperparameters.

Author Contributions

Conceptualization, M.M. and S.M.; Methodology, M.M.; Implementation and Experiments, M.M.; Supervision, G.R.B.; Writing—original draft, M.M.; Writing—review & editing, S.M. and G.R.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset used in the experiments, as well as the code and notebooks, are available in the GitHub repository linked above.

Conflicts of Interest

The authors declare no conflict of interest. Automated detection tools should operate as decision-support systems and preserve due process. We recommend presenting interpretable evidence (matched text snippets, similarity scores) to human reviewers and ensuring data privacy and copyright compliance in training and inference.

Appendix A. Additional Implementation Notes

This appendix contains brief implementation notes for reproducibility:

Use scikit-learn’s TfidfVectorizer with max_features=5000 and norm=’l2’.
Reduce TF–IDF with TruncatedSVD(n_components=256) before concatenation.
Compute SBERT embeddings using SentenceTransformer(’all-mpnet-base-v2’); mean-pool sentence embeddings per document pair.
Use StandardScaler if desired on concatenated features for classifiers sensitive to scale.

References

P. Clough, “Plagiarism in natural and programming languages: an overview of current tools and technologies,” Technical Report, University of Sheffield, 2000.
M. Potthast, S. Hagen, A. Barrón-Cedeno, and B. Stein, “A corpus of plagiarism, paraphrase and near-duplicate detection,” Language Resources and Evaluation, vol. 48, pp. 783–806, 2014.
C. Chukwuneke and O. Nwokorie, “Text mining approach for plagiarism detection using TF-IDF and SVM,” International Journal of Advanced Computer Science and Applications, vol. 11, no. 9, 2020.
T. Foltýnek, M. Meuschke and B. Gipp, “Academic plagiarism detection: a systematic literature review,” ACM Computing Surveys, 2019. [CrossRef]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” NAACL-HLT, 2019.
N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” in EMNLP-IJCNLP, 2019.
Y. Li, X. Zhang, and Z. Sun, “BERT-based deep semantic analysis for text similarity and plagiarism detection,” IEEE Access, vol. 9, pp. 41232–41245, 2021.
S. Sahu, A. Jain, and R. Kaur, “Semantic similarity-based plagiarism detection using NLP techniques,” Procedia Computer Science, vol. 200, pp. 382–390, 2022.

Table 1. Baseline TF–IDF performance (test set).

Model	Accuracy	Precision	Recall	F1
Logistic Regression	0.820	0.812	0.808	0.816
Random Forest	0.797	0.783	0.779	0.781
Naïve Bayes	0.864	0.859	0.847	0.853
SVM (Linear)	0.878	0.875	0.868	0.871

Table 3. Ablation: SVM performance under different features (test set).

Features	Accuracy	F1
TF–IDF only	0.878	0.871
SBERT only	0.883	0.882
TF–IDF + SBERT (hybrid)	0.892	0.903

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.