I. Introduction
Text classification, as one of the core tasks in natural language processing, serves as the foundation for many applications such as sentiment analysis, public opinion monitoring, spam detection, and intelligent question answering. However, with the increasing complexity of models and the wide use of deep learning techniques, text classification models have shown significant vulnerability when facing adversarial perturbations[
1]. Minor character replacements, lexical disturbances, or syntactic variations can cause large deviations in model outputs, threatening the reliability and security of real-world applications. In high-risk scenarios such as financial risk control, public opinion governance, and security review, if text classification models are attacked or misled, serious consequences may arise. Therefore, maintaining model robustness and discriminative ability under adversarial perturbations has become a key scientific challenge in natural language understanding[
2].
In recent years, the rapid development of large language models (LLMs) has provided new perspectives for addressing this problem. With powerful contextual modeling and cross-task generalization capabilities, LLMs have demonstrated exceptional performance in semantic understanding, reasoning, and generation. They can capture deep semantic structures and implicit associations in language, laying a foundation for building robust text classification systems. However, although LLMs achieve outstanding performance on standard datasets, their behavior under adversarial samples remains unsatisfactory. Due to the discrete nature of language and the diversity of semantics, adversarial perturbations often involve slight textual modifications that preserve meaning but deceive models easily, leading to inconsistent predictions. Leveraging the semantic alignment and adaptive capabilities of LLMs to systematically calibrate classification models has thus become a crucial direction for improving robustness[
3,
4].
The introduction of calibration mechanisms provides a promising avenue for enhancing model robustness. Traditional text classification models cannot often correct confidence and prediction consistency, which causes overconfident errors when encountering out-of-distribution or adversarial samples[
5]. By incorporating LLM-based calibration, model outputs can be constrained semantically, enabling the model to focus not only on surface-level features but also on potential semantic consistency and contextual coherence. This semantic-space alignment and calibration strategy can effectively reduce the impact of adversarial disturbances and plays an important role in uncertainty estimation and trustworthy prediction. Therefore, building a robust text classification framework based on LLM calibration is not only theoretically innovative but also practically valuable for high-reliability text understanding tasks.
From an application perspective, research on adversarially robust text classification is crucial for ensuring the safety and trustworthiness of intelligent systems. In fields such as financial risk monitoring, medical text analysis, and social media content regulation, model vulnerability can be exploited by attackers through semantic disguise or textual manipulation to mislead judgments[
6]. For instance, simple word substitutions or syntactic adjustments may cause misclassification of sensitive information, leading to regulatory loopholes or misinformation. LLM-based calibration mechanisms can reconstruct language representations in high-dimensional semantic spaces, making classification decisions less sensitive to perturbations and ensuring stable model performance in open environments. This approach enhances resistance to adversarial attacks and supports the development of adaptive defense and semantic correction capabilities in intelligent systems[
7].
Overall, achieving robust text classification under adversarial perturbations through LLM-based calibration represents an important shift in natural language understanding from performance optimization toward safety and trustworthiness. This direction integrates the generation and comprehension capabilities of LLMs with robustness modeling to explore mechanisms that maintain semantic consistency and decision stability in uncertain environments. The study contributes to improving interpretability and security in artificial intelligence systems and promotes the advancement of trustworthy AI. With the continuous progress of multi-task learning, knowledge distillation, and adaptive alignment, LLM-based robust text classification will become a key foundation for safe AI deployment in critical domains.
II. Method
The proposed method enhances the robustness of text classification models under adversarial perturbations by incorporating a large language model-based calibration mechanism. The architecture comprises three main components: semantic representation extraction, calibration consistency modeling, and adversarial robustness optimization. Semantic representation is achieved by applying multi-granular indexing and confidence constraint techniques [
8], which strengthen both the contextual understanding and reliability of the generated features. For calibration consistency, the framework adopts structure-aware decoding methods [
9] to ensure semantic alignment between clean and adversarial examples, promoting stable classification outcomes even when the input is perturbed. To enhance adversarial robustness, the model utilizes federated fine-tuning with cross-domain semantic alignment [
10], allowing for adaptive adjustment of model parameters in response to both data drift and privacy constraints. Through the direct application of these advanced strategies, the model achieves improved robustness, calibration accuracy, and overall reliability in text classification tasks.First, given an input text sequence
, the model uses a pre-trained language model to extract context-dependent semantic embeddings. Assuming the encoder of the language model is
, the semantic representation can be defined as:
where
represents the semantic vector of the i-th word, and
is the embedding matrix of the entire sentence. Through average pooling or attention-weighted mechanisms, a global sentence vector representation can be further obtained:
where
is the attention weight, and
is the learnable context query vector. The model architecture is shown in
Figure 1.
After obtaining the semantic representation, a calibration layer is introduced into the large language model to perform semantic alignment and confidence constraints on the prediction distribution. For classification tasks, the prediction probability distribution is defined as:
Here,
and
are the classification layer parameters. To reduce the overconfidence problem of the model under adversarial perturbations, a temperature calibration mechanism is introduced, which adjusts the predicted probability through a learnable parameter
:
When is true, the model output tends to be smooth, thus maintaining controllable confidence under adversarial sample interference. This mechanism effectively suppresses the model's sensitive response to small perturbations, making the predicted distribution more consistent with real semantic uncertainty.
To further enhance model robustness, the proposed method introduces an adversarial consistency constraint, ensuring that the semantic representations of original and perturbed samples remain closely aligned in embedding space. Specifically, for an adversarial input x′and its corresponding semantic embedding v′, an adversarial consistency loss is formulated to directly penalize semantic drift caused by adversarial modifications.
To achieve stable and contextually aware semantic alignment, the model applies retrieval-augmented generation and joint modeling techniques [
11]. This allows the system to dynamically retrieve and integrate relevant contextual information during both clean and adversarial input processing, significantly enhancing the resilience of semantic representations against targeted perturbations. Furthermore, contextual trust evaluation [
12] is incorporated at the representation layer, providing an adaptive mechanism to assess and calibrate semantic similarity between
v and
v′, thereby improving the model’s ability to distinguish between genuine meaning preservation and adversarial distortion.
The adversarial consistency constraint also adopts knowledge-augmented learning strategies [
13], enabling the model to maintain invariance in semantic features that are enriched by external knowledge or domain-specific context. This not only supports explainability, but also enhances the robustness of semantic alignment across varied input scenarios. Additionally, the framework utilizes multi-scale LoRA fine-tuning [
14] to introduce flexibility in the alignment of semantic embeddings, allowing the model to effectively adapt its parameter space and semantic mapping in response to adversarially induced shifts.
By systematically applying these advanced adversarial alignment techniques, the method achieves consistent, robust, and knowledge-aware semantic representations, significantly improving the model’s defense against adversarial perturbations and ensuring reliable text classification performance in challenging environments:
This loss minimizes the distance between the original and perturbed samples in the semantic space, ensuring that the model can still capture the semantic essence under adversarial perturbations. Furthermore, to balance classification accuracy and robustness, a joint optimization objective function is introduced:
where
is the cross-entropy loss,
is the confidence calibration loss, and
and
are the balance coefficients. This joint optimization mechanism enables the model to maintain stable predictive ability on both standard and adversarial examples.
Finally, to enhance the semantic discriminative consistency of the model, this paper designs a language model-based self-calibration mechanism, achieving robust feature reconstruction through adaptive feedback updates. For the model's predicted output
across multiple iterations, a dynamic smoothing function is used to achieve stable updates:
Here, is the smoothing coefficient, used to balance the influence of historical predictions and calibration results. Through this mechanism, the model gradually converges to a robust confidence distribution, thereby achieving adaptive semantic calibration and classification stability against adversarial perturbations at the language level.
IV. Conclusion
This study focuses on robust text classification under adversarial perturbations and proposes a stable classification framework based on large language model calibration. The method integrates semantic modeling, confidence regulation, and consistency constraints to achieve global optimization of text representations and fine-grained calibration of prediction distributions. Experimental results show that the proposed approach not only achieves high accuracy and consistency on standard classification tasks but also maintains stable performance when facing adversarial disturbances such as word substitutions, syntactic variations, and noise injection. This research redefines robustness modeling from a semantic perspective and provides new insights into improving the reliability and security of natural language processing systems.
From a mechanism perspective, this work introduces temperature calibration and adversarial consistency loss to alleviate model overconfidence under perturbed samples and enhance interpretability. Within a unified framework of adversarial learning and calibration optimization, the method enables adaptive adjustment in semantic space, allowing the model to dynamically respond to input uncertainty and improve the reliability of classification decisions. Sensitivity analyses across hyperparameters, environments, and data conditions further verify the adaptability of the calibration mechanism in different task scenarios, demonstrating its robustness and scalability in practical applications.
From an application perspective, this study has practical relevance for high-risk text processing scenarios such as intelligent question answering, public opinion analysis, financial risk control, and content moderation. Traditional text classification models are prone to errors when facing adversarial perturbations or distribution shifts. In contrast, the proposed method improves resistance to interference while maintaining high accuracy, providing a feasible path for developing reliable natural language understanding systems. In the fields of information security and decision support, this method can be extended to multimodal robust recognition, cross-lingual semantic alignment, and task transfer settings, offering a strong algorithmic foundation for the secure deployment of artificial intelligence systems.
Future research can advance in three directions. First, finer-grained semantic calibration strategies can be explored to achieve robustness optimization at the word, sentence, and document levels. Second, adaptive adversarial training and uncertainty estimation can be combined to build dynamically adjustable robust classification models with stronger defense capabilities against diverse attack patterns. Third, the proposed framework can be extended to open-domain and multi-task learning environments to enhance the adaptability and generalization ability of large language models in complex semantic contexts. Overall, this study not only provides a new theoretical paradigm for robust text classification but also lays a methodological foundation for the development of trustworthy artificial intelligence.