Doubly robust alignment for large language models

Xu, Erhan, Ye, Kai, Zhou, Hongyi, Zhu, Luhan, Quinzan, Francesco and Shi, Chengchun ORCID: 0000-0001-7773-2099 (2025) Doubly robust alignment for large language models. In: 39th Conference on Neural Information Processing Systems, 2025-11-30 - 2025-12-07. (In Press)

Text (DPRO) - Accepted Version
Download (1MB)

Abstract

This paper studies reinforcement learning from human feedback (RLHF) for aligning large language models with human preferences. While RLHF has demonstrated promising results, many algorithms are highly sensitive to misspecifications in the underlying preference model (e.g., the Bradley-Terry model), the reference policy, or the reward function, resulting in undesirable fine-tuning. To address model misspecification, we propose a doubly robust preference optimization algorithm that remains consistent when either the preference model or the reference policy is correctly specified (without requiring both). Our proposal demonstrates superior and more robust performance than state-of-the-art algorithms, both in theory and in practice. The code is available at https://github.com/DRPO4LLM/DRPO4LLM

Item Type:	Conference or Workshop Item (Paper)
Additional Information:	© 2025 The Author(s)
Divisions:	Statistics
Subjects:	H Social Sciences > HA Statistics Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Date Deposited:	30 Oct 2025 11:24
Last Modified:	31 Oct 2025 13:27
URI:	http://eprints.lse.ac.uk/id/eprint/130006

Actions (login required)

View Item

Download Statistics

Downloads

Downloads per month over past year

View more statistics