Submitted:
29 June 2025
Posted:
30 June 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Alignment method: PPO, rejection sampling, DPO, RLAIF, or AI-assisted critique loops.
- Reward modeling: Human-vs-AI preference data sources and training protocols.
- Implementation details: Architecture, datasets (OASST, HH-RLHF comparisons, Databricks Dolly), training stages, and hyperparameters.
- Evaluation: MT-Bench for multi-turn dialogue, TruthfulQA for factuality, HH-RLHF metrics for helpfulness and harmlessness, plus output diversity and calibration.
- A comprehensive review of open LLM alignment techniques and their performance trade-offs.
- Detailed quantitative comparisons across benchmarks and model scales.
- Introduction and open release of SAWYER, illustrating effective multi-stage preference learning for open models.
2. Related Work
2.1. Reinforcement Learning from Human Feedback (RLHF)
2.2. Open-Source Aligned LLMs
- Mistral 7B and Mixtral 8×7B are strong base models. Mixtral is a sparse Mixture-of-Experts (MoE) model with 8 experts per layer (12.9B active parameters) [6]. Although not RLHF-aligned at release, community projects (like Zephyr) apply RLHF strategies on them.
- Falcon 7B-Instruct, released under Apache-2.0 by TII, was fine-tuned on mixed chat/instruction datasets without RLHF.
- Stanford Alpaca-7B is based on LLaMA 7B and fine-tuned using 52K instruction pairs generated via GPT-3.5. It uses no reward modeling—just plain supervised fine-tuning.
- OpenAssistant collected over 160K multi-turn chat interactions and 460K ratings [7]. Some models use this dataset for RLHF, though no single official "OpenAssistant-7B" exists yet.
- Zephyr 7B (HuggingFace H4) was trained on synthetic preference data generated via GPT-4 and fine-tuned using DPO. The process distilled GPT-4’s behavior into a 7B model, yielding high MT-Bench scores at low cost [4].
| Model | Base | Alignment Method | Notes |
|---|---|---|---|
| LLaMA 2-Chat | LLaMA 2 | PPO + Rejection Sampling | Two reward models |
| LLaMA 3-Chat | LLaMA 3 | PPO (assumed) | 8k context, GQA |
| Mistral | - | None | Pretrained only |
| Mixtral 8×7B | Mistral | SFT | Sparse MoE, fast inference |
| Falcon 7B-Instruct | Falcon | SFT | Apache-2.0 license |
| Alpaca 7B | LLaMA | SFT | 52K GPT-3.5 dialogs |
| OpenAssistant | Varies | PPO | Human ratings (OASST) |
| Zephyr 7B | Mistral | dDPO | GPT-4 teacher model |
3. Alignment Methods
3.1. LLaMA 2 Chat (7B/13B)
3.2. LLaMA 3
3.3. Mistral and Mixtral
3.4. Falcon 7B-Instruct
3.5. OpenAssistant Models
3.6. Alpaca 7B
3.7. Zephyr 7B
- Generating a large synthetic dataset (UltraChat) using ChatGPT.
- Scoring responses with a reward ensemble (UltraFeedback) based on GPT-4.
- Fine-tuning Mistral 7B on these preference pairs using DPO (no reward model required).
3.8. RLHF Implementation Details
| Strategy | Used In | Pros / Cons |
|---|---|---|
| PPO + KL penalty | LLaMA 2-Chat, OpenAssistant | Fine-grained reward shaping; sample inefficiency |
| Rejection Sampling | LLaMA 2-Chat (early stages) | Fast initial tuning; no gradient update |
| DPO | Zephyr, experimental LLaMA2 | Efficient, scalable; depends on high-quality preferences |
| No RLHF (SFT) | Alpaca, Falcon-Instruct | Simplicity; weaker alignment |
4. Experimental Setup
4.1. Pipeline Stages
- 1.
- Red-teaming & AI-assisted critique: We crafted harmful prompts, obtained model responses, and applied a four-step loop: (1) generate adversarial prompts, (2) collect responses, (3) use a critique model to revise responses, (4) fine-tune on revised outputs. We integrated scale supervision by having the AI propose human-grade scores under the Constitutional AI framework.
- 2.
- Instruction fine-tuning: Using GPT-2 (124M parameters) as base, we added special tokens for <query>, <response>, and <pad>. We fine-tuned on 112,097 examples with AdamW, fp16 precision, gradient accumulation (8 steps), batch size 16, over 3 epochs, achieving a steep loss decline after token insertion.
- 3.
- Reward model training: We built a pairwise dataset of 76,117 train and 19,030 validation comparisons drawing from GPT-4, GPT-3.5, OPT-IML, and DaVinci. We trained a bi-encoder reward model with cross-entropy loss in two epochs, obtaining training accuracy 98.40% (epoch 1) and 98.25% (epoch 2), and validation loss 0.1713 → 0.2471.
- 4.
- Reinforcement learning: We aligned the instruction-tuned policy via PPO using the Databricks Dolly dataset (15,011 examples). Key PPO hyperparameters: learning rate 1.4e-6, batch size 4, single PPO epoch per update, KL coefficient=0.02. Rewards were provided by the trained reward model.
- 5.
- Model deployment and evaluation: We compared three variants: base GPT-2, supervised fine-tuned, and PPO-trained. Generation strategies included nucleus sampling and contrastive decoding. We visualized reward distributions and tested across diverse prompts.
- 6.
- Summarization experiments: We fine-tuned FLAN-T5 on CNN/DailyMail with RL guidance; details omitted for brevity.
4.2. Evaluation Metrics
- Reward score: Average reward from the learned reward model.
- Generation quality: Human vs. AI score correlation (MAE, RMSE, Pearson/Spearman).
- Comparative reward: Distribution of rewards across variants on test prompts.
- Loss curves: Training/validation loss for each stage.
5. Results and Analysis
5.1. Instruction Fine-tuning
5.2. Reward Model
5.3. Reinforcement Learning
5.4. Model Comparison
5.5. AI vs. Human Evaluation
5.6. Discussion of Trade-offs
6. Conclusion
- DPO-based distillation emerges as a promising path for efficient alignment, yielding rapid convergence with fewer hyperparameters.
- Model strengths: Zephyr excels in small-model assistant accuracy; Mixtral offers strong capability per compute cost; LLaMA2 balances helpfulness and safety.
- Remaining gap: All open models still lag behind GPT-4 in combined helpfulness and factuality, though the margin continues to shrink.
6.1. Reproducibility
- Surveyed models: Zephyr, Vicuna, Mistral, Mixtral, LLaMA2 variants—model cards and evaluation scripts hosted on GitHub.
- SAWYER pipeline: Five Jupyter notebooks (rlaif.ipynb, sawyer_1_instruction_ft.ipynb, sawyer_2_train_reward_model.ipynb, sawyer_3_rl.ipynb, sawyer_4_use_sawyer.ipynb) include detailed preprocessing, hyperparameters, and evaluation code, built on HuggingFace Transformers and TRL.
- Data and evaluation: OASST, HH-RLHF comparison sets, Databricks Dolly, CNN/DailyMail for summarization—versioned and linked for exact replication.
6.2. Future Work
- Multimodal RLHF: Extend alignment loops to vision and audio, integrating multimodal reward models.
- RLAIF on open architectures: Generalize red-teaming with AI feedback at scale for larger open LLMs (e.g., Mistral, LLaMA2-Chat).
- Adaptive KL and calibration: Explore dynamic KL scheduling and auxiliary confidence heads to stabilize PPO and improve calibration (reduce ECE).
- Ensembling critique models: Combine diverse AI critics (GPT-4, Claude, open reward models) to mitigate positional bias and enhance safety.
- Deeper safety evaluation: Develop fine-grained benchmarks for edge-case and adversarial behaviors, and integrate constitutional constraints more tightly in RL loops.
References
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. In Proceedings of the arXiv preprint arXiv:2307.09288, 2023. [CrossRef]
- HuggingFace. Transformers Reinforcement Learning (TRL). HuggingFace Blog 2023.
- Askell, A.; Bakhtin, A.; Askhuller, A.; Agarwal, S.; Adler, S.; Ahn, M.; Akbari, N.; Aldrin, A.; Aleman, D.; et al. Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2403.08309 2024.
- Tunstall, L.; Bühler, E.; Rajani, N.; Sheikholeslami, S.; Shikhar, S.; Jain, U.; Vaillancourt, N.; Rajkumar, N.; Haider, I.; Shahnawaz, M.; et al. Zephyr: Direct Distillation of LM Alignment. arXiv preprint arXiv:2310.16944 2023. [CrossRef]
- AI, M.: Meta-llama-3 model card (2024), https://huggingface.co/meta-llama/ Meta-Llama-3-8B.
- AI, M.: Mixtral of experts: A high-performance sparse mixture of experts (2023), https://mistral.ai/news/mixtral-of-experts/.
- Koudi, A.; Giannakea, S.; Papadimitriou, I.; Papanikolaou, N.; Kremmydas, G.; Makris, T.; Papoutsakis, S.; Koutsoukos, N.; Papamichail, N.; Manitsaris, S. OpenAssistant Conversations - Democratizing Large Language Model Alignment. arXiv preprint arXiv:2304.07327 2023. [CrossRef]

| Stage | Train Acc. (%) | Val Loss | Key Metric |
|---|---|---|---|
| Instruction FT | – | 0.12 (final) | Loss drop 45% |
| Reward Model (epoch 1) | 98.40 | 0.1713 | – |
| Reward Model (epoch 2) | 98.25 | 0.2471 | – |
| PPO Alignment | – | – | Avg reward increase 30% |
| Deployment Eval | – | – | RL reward 2.45 |
| AI vs. Human Scoring | – | – | Pearson 0.82 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).