Adversarial Robustness in Text Classification through Semantic Calibration with Large Language Models

Chihui Shao; Yun Zi; Yingnan Deng; Heyao Liu; Chong Zhang; Yinan Ni

doi:10.20944/preprints202602.0617.v1

Submitted:

07 February 2026

Posted:

09 February 2026

You are already at the latest version

Abstract

This paper addresses the problem of text classification models being vulnerable and lacking robustness under adversarial perturbations by proposing a robust text classification method based on large language model calibration. The method builds on a pretrained language model and constructs a multi-stage framework for semantic representation and confidence regulation. It achieves stable optimization of classification results through semantic embedding extraction, calibration adjustment, and consistency constraints. First, the model uses a pretrained encoder to generate context-aware semantic features and applies an attention aggregation mechanism to obtain global semantic representations. Second, a temperature calibration mechanism is introduced to smooth the output probability distribution, reducing the model's sensitivity to local perturbations. Third, adversarial consistency constraints are applied to maintain feature alignment between original and perturbed samples in semantic space, ensuring dynamic preservation of semantic robustness. The method adopts a joint loss function to balance three optimization objectives: classification accuracy, robustness, and confidence. To verify its effectiveness, sensitivity experiments on hyperparameters, environments, and data distributions are conducted. The results show that the model maintains high performance and stability under conditions such as word substitution, noise injection, and class imbalance, significantly outperforming several mainstream baseline models. This study achieves the integration of semantic-level robustness optimization and calibration learning, providing a new approach for building highly reliable text classification systems.

Keywords:

robust text classification

;

large language model calibration

;

semantic consistency constraints

;

adversarial perturbations

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Adversarial Robustness in Text Classification through Semantic Calibration with Large Language Models

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe