← Previous · All Episodes · Next →
The Differences Between Direct Alignment Algorithms are a Blur Episode 483

The Differences Between Direct Alignment Algorithms are a Blur

· 20:11

|

🤗 Upvotes: 84 | cs.LG

Authors:
Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, Daniil Gavrilov

Title:
The Differences Between Direct Alignment Algorithms are a Blur

Arxiv:
http://arxiv.org/abs/2502.01237v1

Abstract:
Direct Alignment Algorithms (DAAs) simplify language model alignment by replacing reinforcement learning (RL) and reward modeling (RM) in Reinforcement Learning from Human Feedback (RLHF) with direct policy optimization. DAAs can be classified by their ranking losses (pairwise vs. pointwise), by the rewards used in those losses (e.g., likelihood ratios of policy and reference policy, or odds ratios), or by whether a Supervised Fine-Tuning (SFT) phase is required (two-stage vs. one-stage). We first show that one-stage methods underperform two-stage methods. To address this, we incorporate an explicit SFT phase and introduce the $\beta$ parameter, controlling the strength of preference optimization, into single-stage ORPO and ASFT. These modifications improve their performance in Alpaca Eval 2 by +$3.46$ (ORPO) and +$8.27$ (ASFT), matching two-stage methods like DPO. Further analysis reveals that the key factor is whether the approach uses pairwise or pointwise objectives, rather than the specific implicit reward or loss function. These results highlight the importance of careful evaluation to avoid premature claims of performance gains or overall superiority in alignment algorithms.


Subscribe

Listen to Daily Paper Cast using one of many popular podcasting apps or directories.

Apple Podcasts Spotify Overcast Pocket Casts
← Previous · All Episodes · Next →