Inference-Time Control for Trustworthy Large Language Models

Yuyang Bai; Zheyuan Liu; Han Yan; Zhangchen Xu; Yixin Wan; Canyu Chen; Zehong Wang; Xiangchi Yuan; Yue Huang; Guangyao Dou; Yuji Zhang; Hangxiao Zhu; Zhuofeng Li; Manling Li; Xiangliang Zhang; Mohit Bansal; Sanmi Koyejo; Kai-Wei Chang; Yu Zhang; Meng Jiang

doi:10.20944/preprints202605.1041.v1

Submitted:

14 May 2026

Posted:

15 May 2026

You are already at the latest version

Abstract

Once a large language model is released, training-time alignment is hard to revise; yet deployment introduces context-specific risks that the original training cannot anticipate: evolving safety policies, jurisdictional constraints, retrieval contamination, and adaptive adversarial prompting. In this paper, we unify inference-time techniques for trustworthy generation across safety, privacy, fairness, and factuality under a single framework: the inference-time control plane, with three tiers of intervention -- External Controls (Context Engineering, Guardrails, Decoding Strategies), which act around the model; Internal Manipulations (Representation Engineering, Unlearning, Pruning), which act inside it; and System-Level Orchestration (Multi-Agent Systems), which coordinate several models. We also introduce a meta-axis evaluation framework that crosses the four trustworthiness dimensions with five evaluation axes (effectiveness, locality, generality, interpretability, efficiency), and describe representative metrics at each intersection. We identify four cross-cutting open problems: brittleness under adaptive adversaries, the control-utility tradeoff, verification of removal, and the composition of layered interventions. A curated paper list is available at https://github.com/leopoldwhite/Awesome-Inference-Time-Trustworthiness.

Keywords:

large language models

;

trustworthy AI

;

inference-time control

;

safety

;

privacy

;

fairness

;

factuality

;

guardrails

;

decoding strategies

;

representation engineering

;

machine unlearning

;

multi-agent systems

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Inference-Time Control for Trustworthy Large Language Models

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe