Preprint
Article

This version is not peer-reviewed.

Inference-Time Control for Trustworthy Large Language Models

Submitted:

14 May 2026

Posted:

15 May 2026

You are already at the latest version

Abstract
Once a large language model is released, training-time alignment is hard to revise; yet deployment introduces context-specific risks that the original training cannot anticipate: evolving safety policies, jurisdictional constraints, retrieval contamination, and adaptive adversarial prompting. In this paper, we unify inference-time techniques for trustworthy generation across safety, privacy, fairness, and factuality under a single framework: the inference-time control plane, with three tiers of intervention -- External Controls (Context Engineering, Guardrails, Decoding Strategies), which act around the model; Internal Manipulations (Representation Engineering, Unlearning, Pruning), which act inside it; and System-Level Orchestration (Multi-Agent Systems), which coordinate several models. We also introduce a meta-axis evaluation framework that crosses the four trustworthiness dimensions with five evaluation axes (effectiveness, locality, generality, interpretability, efficiency), and describe representative metrics at each intersection. We identify four cross-cutting open problems: brittleness under adaptive adversaries, the control-utility tradeoff, verification of removal, and the composition of layered interventions. A curated paper list is available at https://github.com/leopoldwhite/Awesome-Inference-Time-Trustworthiness.
Keywords: 
;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated