Submitted:
10 February 2025
Posted:
10 February 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Literature Review
2.1. Automated Complaint Classification and Prioritization
2.2. Sentiment and Emotion Analysis in Complaints
2.3. Machine Learning and AI for Complaint Processing
2.4. Broader Implications of Consumer Complaints
2.5. Topic Modeling and Sentiment Analysis
2.6. Deep Learning and AI-Driven Monitoring
2.7. Consumer Behavior in Complaints
2.8. AI-Powered Chatbots for Complaint Resolution
3. Materials and Methods
3.1. Description of the Dataset
- Credit Reporting: 56.14%
- Debt Collection: 14.25%
- Mortgages and Loans: 11.69%
- Credit Cards: 9.58%
- Retail Banking: 8.33%
3.2. Preprocessing Steps
3.2.1. Removing Excessively Long Narratives
3.2.2. Standardizing Narrative Text
- Converting text to lowercase
- Removing special characters, excessive whitespace, and inconsistent formatting
- Normalizing text structure to ensure cleaner inputs
3.2.3. Removing Entries with Empty Fields
3.2.4. Creating a Balanced Subset Using Undersampling
- 200 complaints per category
- 20.00% representation per category
3.3. Prompt Engineering
| conversation.append({ 'role': 'user', 'content': 'You are an AI assistant specializing in consumer complaint classification.' 'Your task is to analyze a consumer complaint and classify it into the most' 'appropriate category from the predefined list:' '["retail_banking", "credit_reporting", "credit_card", "mortgages_and_loans", "debt_collection"]' 'Provide your final classification in the following JSON format without explanations:' '{"product": "chosen_category_name"}.\nComplaint: ' '...' }) |
(1) |
3.4. Model Predictions
- Iterating through the dataset – Each complaint was processed row by row.
- Constructing the prompt – The complaint text was dynamically combined with the predefined complaint categories to generate a structured prompt for classification.
- Managing API communications – The appropriate API was called for each respective model, ensuring seamless interaction.
- Handling responses – The models were instructed to return their predictions in JSON format. However, DeepSeek-V3 frequently included extra text outside the expected JSON structure, requiring additional processing to extract and clean the valid classification output. A separate method was implemented to search for and capture the valid JSON response, ensuring uniformity across all model outputs.
- Storing results – All predictions were saved within the same dataset file, making it easier to analyze classification performance across different models.
3.5. Model Evaluation Strategy
3.5.1. Accuracy-Based Comparison
- Accuracy – The proportion of correctly classified complaints.
- Precision (weighted) – How often a model's predicted category was correct, considering class imbalances.
- Recall (weighted) – How well the model identified complaints belonging to each category.
- F1-score (weighted) – The harmonic mean of precision and recall, balancing both metrics.
3.5.2. Confusion Matrix and Heatmap Analysis
- Common misclassifications – Whether certain categories were frequently confused with others.
- Model biases – If a model tended to favor certain categories over others.
4. Results
4.1. Classification Performance
4.1.1. Overall Accuracy
4.1.2. Precision, Recall, and F1-Score
4.2. Cost Efficiency
4.3. Inference Speed and Latency
- GPT-4o-mini was the fastest model (0.86s per prediction), followed closely by GPT-4o (0.89s). These models are ideal for low-latency applications requiring fast classification.
- Claude 3.5 Sonnet and Claude 3.5 Haiku were slower (~1.8s per prediction), suggesting higher computational demands.
- DeepSeek Chat was the slowest (26.69s per prediction), making it unsuitable for real-time applications despite its low cost.
4.4. Heatmaps: Model Performance Analysis
4.4.1. Claude 3.5 Sonnet
4.4.2. DeepSeek-V3
4.4.3. GPT-4o Model
4.4.4. GPT-4o Mini
- Credit card complaints misclassified as retail banking (27 instances).
- Credit reporting complaints confused with debt collection (17 instances) and mortgages and loans (19 instances).
- Debt collection complaints misclassified as credit reporting (51 instances).
- Retail banking misclassified into credit card (29 instances).
4.4.5. Claude 3.5 Haiku
- Credit card complaints misclassified as retail banking (29 instances).
- Credit reporting complaints confused with debt collection (20 instances) and mortgages and loans (16 instances).
- Debt collection complaints misclassified as credit reporting (49 instances).
- Retail banking complaints misclassified into credit card (36 instances).
5. Discussion
5.1. Overview of Classification Performance
5.2. Misclassification Patterns and Biases
5.3. Comparative Performance of LLMs
5.4. Model Selection Based on Performance Metrics
- Claude 3.5 Sonnet is the best choice when accuracy is the highest priority, particularly for financial complaint classification, where errors can impact case handling.
- GPT-4o provides a strong balance between accuracy, cost, and inference speed, making it ideal for real-time applications.
- DeepSeek Chat is the most cost-effective option but is hindered by slow inference speeds.
- GPT-4o-mini is well-suited for scenarios requiring fast classifications at minimal cost, although its lower accuracy may be a drawback.
5.5. Addressing Common Misclassification Challenges
5.6. Recommendations for Improving Classification Accuracy
- Fine-tuning on domain-specific financial data to improve model understanding of nuanced complaint categories.
- Incorporating context-aware embeddings to mitigate misclassification in overlapping financial categories.
- Enhancing category definitions to provide clearer distinctions between similar complaint types.
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Pio, P.G.C.; Sigahi, T.; Rampasso, I.S.; Satolo, E.G.; Serafim, M.P.; Quelhas, O.L.G.; Leal Filho, W.; Anholon, R. Complaint Management: Comparison between Traditional and Digital Banks and the Benefits of Using Management Systems for Improvement. International Journal of Productivity and Performance Management 2024, 73, 1050–1070. [Google Scholar] [CrossRef]
- Consumer Financial Protection Bureau (CFPB) Consumer Complaint Database | Consumer Financial Protection Bureau . Available online: https://www.consumerfinance.gov/data-research/consumer-complaints/ (accessed on 7 February 2025).
- Vairetti, C.; Aránguiz, I.; Maldonado, S.; Karmy, J.P.; Leal, A. Analytics-Driven Complaint Prioritisation via Deep Learning and Multicriteria Decision-Making. Eur J Oper Res 2024, 312, 1108–1118. [Google Scholar] [CrossRef]
- Sharma, S.; Vashisht, M.; Kumar, V. Enhanced Customer Insights: Multimodal NLP Feedback System. 2024 IEEE International Students’ Conference on Electrical, Electronics and Computer Science, SCEECS 2024 2024. [Google Scholar] [CrossRef]
- Roy, T.S.; Vasukidevi, G.; Malleswari, T.Y.J.N.; Ushasukhanya, S.; Namratha, N. Automatic Classification of Railway Complaints Using Machine Learning. E3S Web of Conferences 2024, 477, 00085. [Google Scholar] [CrossRef]
- Roumeliotis, K.I.; Tselikas, N.D.; Nasiopoulos, D.K. LLMs and NLP Models in Cryptocurrency Sentiment Analysis: A Comparative Classification Study. Big Data and Cognitive Computing 2024, Vol. 8, Page 63 2024, 8, 63. [Google Scholar] [CrossRef]
- Roumeliotis, K.I.; Tselikas, N.D.; Nasiopoulos, D.K. Fake News Detection and Classification: A Comparative Study of Convolutional Neural Networks, Large Language Models, and Natural Language Processing Models. Future Internet 2025, Vol. 17, Page 28 2025, 17, 28. [Google Scholar] [CrossRef]
- DeepSeek-AI; :; Bi, X.; Chen, D.; Chen, G.; Chen, S.; Dai, D.; Deng, C.; Ding, H.; Dong, K.; et al. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism. 2024.
- Models - OpenAI API. Available online: https://platform.openai.com/docs/models#gpt-4o (accessed on 10 February 2025).
- Models - Anthropic. Available online: https://docs.anthropic.com/en/docs/about-claude/models (accessed on 10 February 2025).
- Sakas, D.P.; Reklitis, D.P.; Terzi, M.C.; Glaveli, N. Growth of Digital Brand Name through Customer Satisfaction with Big Data Analytics in the Hospitality Sector after the COVID-19 Crisis. International Journal of Information Management Data Insights 2023, 3, 100190. [Google Scholar] [CrossRef]
- Roumeliotis, K.I.; Tselikas, N.D.; Nasiopoulos, D.K. Leveraging Large Language Models in Tourism: A Comparative Study of the Latest GPT Omni Models and BERT NLP for Customer Review Classification and Sentiment Analysis. Information 2024, Vol. 15, Page 792 2024, 15, 792. [Google Scholar] [CrossRef]
- Kumar, P.; Singh, A.; Saha, S. Navigating the Indian Code-Mixed Terrain: Multitasking Analysis of Complaints, Sentiment, Emotion, and Severity. 2024. [Google Scholar] [CrossRef]
- Das, S.; Singh, A.; Saha, S.; Maurya, A. Negative Review or Complaint? Exploring Interpretability in Financial Complaints. IEEE Trans Comput Soc Syst 2024, 11, 3606–3615. [Google Scholar] [CrossRef]
- Jondhale, R.; Patil, S.; Shinde, A.; Ajalkar, D.; Biradar, S. Predicting Consumer Complaint Disputes in Finance Using Machine Learning (AIOPS). 2nd IEEE International Conference on Advances in Information Technology, ICAIT 2024 - Proceedings 2024. [CrossRef]
- IMPLEMENTING CHATBOTS FOR CONSUMER COMPLAINT RESPONSE | Proceedings of International Conference on Modern Science and Scientific Studies. Available online: https://econferenceseries.com/index.php/icmsss/article/view/3992 (accessed on 9 February 2025).
- Zhou, X.; Cao, G.; Peng, B.; Xu, X.; Yu, F.; Xu, Z.; Yan, Y.; Du, H. Citizen Environmental Complaint Reporting and Air Quality Improvement: A Panel Regression Analysis in China. J Clean Prod 2024, 434, 140319. [Google Scholar] [CrossRef]
- Khadija, M.A.; Nurharjadmo, W. Enhancing Indonesian Customer Complaint Analysis: LDA Topic Modelling with BERT Embeddings. SINERGI 2024, 28, 153–162. [Google Scholar] [CrossRef]
- Song, W.; Rong, W.; Tang, Y. Quantifying Risk of Service Failure in Customer Complaints: A Textual Analysis-Based Approach. Advanced Engineering Informatics 2024, 60, 102377. [Google Scholar] [CrossRef]
- Seok, J.; Kim, C.; Kim, S.; Kim, Y.-M. Deep-Learning-Based Customer Complaints Monitoring System Using Online Review. 2024. [CrossRef]
- Jia, S.; Shan, G.; Chi, O.H. Leveraging Generative AI for Customer Complaint Resolution: A Comparative Analysis with Human Responses. AMCIS 2024 Proceedings 2024. [Google Scholar]
- Lee, C.H.; Zhao, X. Data Collection, Data Mining and Transfer of Learning Based on Customer Temperament-Centered Complaint Handling System and One-of-a-Kind Complaint Handling Dataset. Advanced Engineering Informatics 2024, 60, 102520. [Google Scholar] [CrossRef]
- Wang, R.; Wang, H.; Li, S. Predicting the Determinants of Consumer Complaint Behavior in E-Commerce Live-Streaming: A Two-Staged SEM-ANN Approach. IEEE Trans Eng Manag 2025, 1–12. [Google Scholar] [CrossRef]
- Juipa, A.; Guzman, L.; Diaz, E. Sentiment Analysis-Based Chatbot System to Enhance Customer Satisfaction in Technical Support Complaints Service for Telecommunications Companies.; International Conference on Smart Business Technologies (ICSBT), 2024.
- Tiwari, S. Consumer Complaints Dataset for NLP. Available online: https://www.kaggle.com/datasets/shashwatwork/consume-complaints-dataset-fo-nlp (accessed on 7 February 2025).
- Zhang, K.; Zhou, F.; Wu, L.; Xie, N.; He, Z. Semantic Understanding and Prompt Engineering for Large-Scale Traffic Data Imputation. Information Fusion 2024, 102, 102038. [Google Scholar] [CrossRef]
- Anthropic PBC Automatically Generate First Draft Prompt Templates - Anthropic. Available online: https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/prompt-generator (accessed on 30 November 2024).
- Roumeliotis, K. GitHub - Applied-AI-Research-Lab/DeepSeek-LLM-and-GPT-Fall-Behind-Claude-Leads-in-Zero-Shot-Consumer-Complaints-Classification. Available online: https://github.com/Applied-AI-Research-Lab/DeepSeek-LLM-and-GPT-Fall-Behind-Claude-Leads-in-Zero-Shot-Consumer-Complaints-Classification (accessed on 7 February 2025).
- Zhao, S.; Guo, Y.; Sheng, Q.; Shyr, Y. Advanced Heat Map and Clustering Analysis Using Heatmap3. Biomed Res Int 2014, 2014, 986048. [Google Scholar] [CrossRef] [PubMed]
- Cade Metz OpenAI Says DeepSeek May Have Improperly Harvested Its Data - The New York Times. Available online: https://www.nytimes.com/2025/01/29/technology/openai-deepseek-data-harvest.html (accessed on 9 February 2025).

| Model | Accuracy | Precision | Recall | F1 | Cost ($) |
|---|---|---|---|---|---|
| gpt-4o | 0.736 | 0.7546 | 0.736 | 0.7358 | 0.45 |
| gpt-4o-mini | 0.691 | 0.7026 | 0.691 | 0.6933 | 0.03 |
| claude-3-5-sonnet-20241022 | 0.763 | 0.7973 | 0.763 | 0.7617 | 0.64 |
| claude-3-5-haiku-20241022 | 0.721 | 0.7287 | 0.721 | 0.7227 | 0.17 |
| deepseek-chat | 0.737 | 0.752 | 0.737 | 0.7368 | 0.01 |
| Model | Mean Prediction Time | Total Time (for 1,000 complaints) |
|---|---|---|
| gpt-4o | 0.89s | 889.25s |
| gpt-4o-mini | 0.86s | 860.44s |
| claude-3-5-sonnet-20241022 | 1.8s | 1797.11s |
| claude-3-5-haiku-20241022 | 1.81s | 1806.23s |
| deepseek-chat | 26.69s | 26686.94s |
| Category | Best Model |
|---|---|
| Best for Accuracy | Claude 3.5 Sonnet (0.763 accuracy, highest precision & recall) |
| Best for Cost Efficiency | DeepSeek-V3 ($0.01 per 1,000 classifications) but slow |
| Best for Speed | GPT-4o-mini (0.86s per prediction) |
| Most Balanced Model | GPT-4o (good accuracy, reasonable cost, fast inference) |
| Worst in Accuracy | GPT-4o-mini (0.691 accuracy, lowest F1-score) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).