Submitted:
01 August 2025
Posted:
01 August 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Literature Review
Foundations of Sentiment Analysis
ABSA Benchmarks
Transformer Advances for Sentiment Tasks
Parameter-Efficient Fine-Tuning
Explainability Techniques
NLP for Supply-Chain and Supplier-Risk Analytics
Research Gap and Contribution
We Contribute
| # | Reference | Domain / Task | Method | Reported Result | Limitation Addressed |
|---|---|---|---|---|---|
| 1 | (Pang & Lee, 2008) | Sentiment survey | SVM, lexicons | Baseline framing | Lacks aspect granularity |
| 2 | (Pontiki et al., 2014) | ABSA benchmark | Rule + SVM | 0.78 F1 (laptop) | Consumer domain only |
| 3 | (Devlin et al., 2019) | General NLU | BERT | +11 pp SST-2 | Costly fine-tuning |
| 4 | (He et al., 2021) | General NLU | DeBERTa | SOTA SuperGLUE | No ABSA evidence |
| 5 | (Hu et al., 2021) | Efficient FT | LoRA | 99 % fewer params | Not tested on ABSA |
| 6 | (Lundberg & Lee, 2017) | Explainability | SHAP | Model-agnostic | Few text use-cases |
| 7 | (Xu et al., 2020) | ABSA | BERT+TAPT | 0.87 F1 | No interpretability |
| 8 | (González-Carvajal & Garrido-Merchán, 2024) | Explainable ABSA | RoBERTa + SHAP | 0.82 F1 | Perf./explain trade-off |
| 9 | (Liang et al., 2017) | Supply-chain risk | Lexicon+NB | 0.68 Acc. | Low accuracy |
| 10 | (Zhang et al., 2023) | Supplier ranking | Bi-LSTM ABSA | 0.74 F1 | No global explanations |
3. Research Objectives
4. Data and Exploratory Data Analysis
Data Description and Sampling
Preprocessing
Exploratory Data Analysis (EDA)
- Rating Distribution: A strong skew toward 4–5 star reviews, evidencing class imbalance.

- Review Length: Predominantly short texts, with a small subset of longer reviews

- Negative Review Keywords: Common terms such as “broken”, “battery”, “refund” emphasized product-quality issues; references to shipping or delivery delays were minimal.

5. Methods
ABSA-SHAP Pipeline

| Component. | Setting | Rationale |
|---|---|---|
| Base model | yangheng/deberta-v3-base-absa-v1.1 (HF hub) | Strong ABSA starting point |
| Parameter-efficient tuning | LoRA, rank = 32, α = 64, dropout = 0.05, injected into query/key/value/output projection and FFN dense layers | Keeps only ≈ 2.8 % of parameters trainable, cutting GPU memory and wall-time while preserving accuracy |
| Imbalance mitigation | Inverse-frequency class weights inside a custom CrossEntropyLoss (see WeightedLossTrainer code) | Offsets the 5 : 3 : 1 skew observed in the labelled data |
| Data split | Stratified 60 / 20 / 20 train-validation-test (907 / 303 / 303 rows) | Guarantees identical class proportions across splits |
| Batch / sequence length | Batch = 16, max_len = 256 tokens | Fits comfortably on a single A100-40 GB GPU |
| Optimiser & schedule | AdamW, LR = 5 × 10⁻⁴, linear warm-up 10 % | Empirically stable for LoRA on classification tasks |
| Epochs & early stopping | Trained for 60 epochs; best checkpoint epoch 56 selected by highest validation macro-F1 (0.934) | Prevents over-fitting while capturing late-epoch gains |
| Precision | Full FP32 | Avoids numerical instability seen with mixed-precision for small-batch LoRA |
- Model: DeBERTa-v3-base fine-tuned for aspect-based sentiment classification
- Adaptation: LoRA (r=32, α=64) applied to the top transformer layers
- Training: 160 epochs, learning rate 5e-4, batch size 16, full FP32 on A100 GPU
- Best checkpoint: Epoch 56 (global step 3192), validation accuracy = 95%, macro-F1 = 95%
- SHAP PartitionExplainer
- Token-level interpretability for each predicted sentiment
- Global and local visualizations provided
Predictors and Outcome Measures
Evaluation Metrics
AI Assistance
6. Results
Classification Performance
| Model. | Accuracy | Macro-F1 |
|---|---|---|
| DeBERTa-v3 + LoRA | 0.927 | 0.927 |
| TF-IDF + Linear SVM | 0.815 | 0.814 |
| VADER | 0.465 | 0.396 |
| TextBlob | 0.432 | 0.378 |
Qualitative Inspection of Mis-Classifications

- Mixed or contrastive clauses (FN = 3, FP = 2). Examples such as “Amazingly the product wasn’t bent, however the box was abused” contain a positive opener followed by a negative qualifier. The model attends to the first clause and under-weights the adversative marker “however”.
- Soft-negative qualifiers (FN = 4). Phrases like “video quality is just ok” or “I guess two-year life is reasonable” lack overtly negative adjectives and are misread as neutral/positive.
- Lengthy service rants with embedded positives (FN = 3). Multi-paragraph complaints (e.g., the Comcast modem review) interleave neutral hardware details and scathing customer-service anecdotes; attention diffuses across the long context.
- Mild disappointment labelled neutral/positive by annotators (FP = 5). Reviews rated 3 ★ (“it’s nice but smaller than expected”) use words like “small”, “cheap” or “so/so”; the model over-reacts, yielding false negatives.
| Cause | Definition | FN | FP | Total |
|---|---|---|---|---|
| Mixed / contrastive clauses | Positive + “but/however” + negative | 3 | 2 | 5 |
| Soft-negative qualifiers | “just ok”, “reasonable”, understatement | 4 | 0 | 4 |
| Long rant with dispersed polarity | Multi-paragraph, topic shifts | 3 | 0 | 3 |
| Mild disappointment mis-scored | 3 ★ texts with light criticism | 0 | 5 | 5 |
| Domain jargon / ambiguity | “DOCSIS”, “USB-C” unfamiliar tokens | 1 | 1 | 2 |
| Total | 11 | 8 | 20 |
| |ID | LoRA rank (r) | Class-weighting | Epochs | Macro-F1 |
|---|---|---|---|---|
| B0 (baseline) | 32 | ✓ | 64 | 0.9307 |
| R16 | 16 | ✓ | 64 | 0.9008 |
| R8 | 8 | ✓ | 64 | 0.8909 |
| R4 | 4 | ✓ | 64 | 0.9074 |
| R64 | 64 | ✓ | 64 | 0.9105 |
| CW- | 32 | ✗ | 64 | 0.9206 |
| E5 | 32 | ✓ | 128 | 0.9043 |
| E2 | 32 | ✓ | 32 | 0.9005 |
Key Findings
Class Weighting Contributes ~1 pp
More Epochs Do not Always Help
Pareto of Themes

Explainability Workflow


Actionable Insights
Managerial and Financial Impact Analysis
Business Context
Savings from Early Detection
Product-Return Reduction
Operational Visibility and Negotiation Leverage
Cost of Ownership
Intangible Benefits
7. Conclusion and Discussion
Implications and Contributions
- Improved supplier evaluation accuracy using transformer-based NLP (answering RQ1)
- Identification of key negative and positive aspects—product failures, delivery timeliness, quality/performance and customer service—through SHAP (answering RQ2)
- Explainable SHAP-based insights enabling procurement teams to select reliable suppliers and target improvement programmes (answering RQ3)
- By focusing corrective action on the top four themes, we can address nearly three-quarters of negative customer sentiment—an 80/20 leverage confirmed by SHAP analysis.
- The rarity but high severity of issues like returns/exchanges suggests establishing early-warning monitors even for low-frequency complaints.
- The token-level scatter validates our model’s alignment with domain intuition, bolstering trust in the explainability framework.
Limitations and Threats to Validity
Future Work
Funding
Conflicts of Interest
Appendix A
Data Availability Statement
- Amazon Reviews 2023 (Electronics): We utilized a subset from the Amazon Reviews 2023 dataset in the Electronics category, containing approximately 18.3 million reviews authored by 1.6 million users covering 43.9 million products. The dataset encompasses roughly 2.7 billion tokens of review text and 1.7 billion tokens of item metadata. The data was accessed through the Hugging Face datasets library (load_dataset(“McAuleyLab/AmazonReviews2023”, “raw_review_Electronics”)), from which we randomly sampled 10,000 reviews for exploratory analysis and initial modeling. Variables retained were the review text and star ratings; other metadata were discarded.
- Proprietary B2B Electronics Reviews: We employed a proprietary dataset comprising 1,513 customer reviews from a B2B electronics retailer. Each record includes the review text, a manually annotated sentiment label (negative, neutral, or positive), product category information (e.g., cables, peripherals), as well as associated identifiers and timestamps. Due to confidentiality agreements with the data provider, this proprietary dataset cannot be publicly shared. Interested researchers should contact the corresponding author to discuss potential access subject to a non-disclosure agreement.
References
- BayWater Packaging. (2024, May 20). Did the packaging company maintain a consistent supply chain in 2024?https://baywaterpackaging.com/did-the-packaging-company-maintain-a-consistent-supply-chain-in- 2024/.
- Devlin, J. , Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171– 4186. [Google Scholar] [CrossRef]
- González-Carvajal, L. , & Garrido-Merchán, E. C. (2024). Explainable aspect-based sentiment analysis with transformers and SHAP. Expert Systems with Applications, 232,. [CrossRef]
- He, P. , Liu, X., Gao, J., & Chen, W. (2021). DeBERTa: Decoding-enhanced BERT with disentangled attention [Preprint]. arXiv. 2006, 120241. [Google Scholar]
- Hu, E. J. , Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, L., Wang, S., Raj, A., Liu, H., & Chen, W. (2021). LoRA: Low-rank adaptation of large language models. arXiv Preprint, arXiv:2106.09685. https://arxiv.org/abs/2106.09685.
- Liang, H. , Li, J., Li, Y., & Li, M. (2017). TESSA: A Twitter-enabled early warning system for supply-chain disruptions. International Journal of Production Research, 55, 6931 55(23), 6917-6931. [Google Scholar] [CrossRef]
- Lundberg, S. M. , & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In I. Guyon et al. (Eds.), Advances in Neural Information Processing Systems (Vol. 30, pp. 4765-4774). Curran Associates. https://arxiv.org/abs/1705. 0787. [Google Scholar]
- MetricHQ. (2025, May 9). On-time delivery (OTD). MetricHQ.
- Pang, B. , & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1-2), 1-135. [CrossRef]
- Perikos, I. , & Diamantopoulos, A. (2024). Explainable aspect-based sentiment analysis using transformer models. Big Data and Cognitive Computing, 8(11), 141. [CrossRef]
- Pontiki, M. , Galanis, D., Pavlopoulos, J., Papageorgiou, H., Androutsopoulos, I., & Manandhar, S. (2014). SemEval-2014 task 4: Aspect-based sentiment analysis. Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval-2014) (pp. 27-35). Association for Computational Linguistics. [CrossRef]
- Xu, H. , Liu, B., Shu, L., & Yu, P. S. (2020). BERT post-training for review reading comprehension and aspect-based sentiment analysis. Proceedings of the 2020 Conference of the North American Chapter of the Association for Computational Linguistics:-Human Language Technologies (pp. 2324-2335). Association for Computational Linguistics. [CrossRef]
- Zetwerk. (2025). Supply chain reliability: Best business practices. Zetwerk: Knowledge Base. https://www.zetwerk.com/knowledgebase/supplychainreliabilitybestpractices.
- Zhang, Y. , Wang, X., & Chen, Z. (2023). Supplier selection and ranking using aspect-based sentiment analysis of online reviews. Computers & Industrial Engineering, 179, 108000. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).