Preprint
Review

This version is not peer-reviewed.

A Survey on Hint-Based RLVR: Overcoming Zero-Advantage Failures with External Textual Signals

  † These authors contributed equally to this work.

Submitted:

12 June 2026

Posted:

12 June 2026

You are already at the latest version

Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become a central paradigm for post-training large language models, yet group-relative methods often suffer from zero advantage failures, where identical rollout rewards erase the policy-gradient signal. A growing body of work addresses this bottleneck by intervening in rollout-group construction to restore learnable contrasts. Among these efforts, methods that introduce external textual signals beyond the model’s own distribution, such as reference trajectories, abstract scaffolds, and reusable experience, have emerged as a key branch, as they can restore learnable contrasts while expanding the model’s capability boundary. This survey provides the first systematic survey of this branch: we introduce Hint as a unifying concept for such external textual signals and organize hint-based RL methods into sample-level hints, covering trajectory-based and scaffold-based guidance, and task-level hints, covering static and evolving experience bases. Beyond taxonomy, we further clarify the boundaries, cross-level analysis of construction and utilization, and future directions. We maintain an up-to-date resource list at https://github.com/WYRipple/Awesome-Hint-Based-RL
Keywords: 
;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated