Some technical intuition on RLHF and Direct Preference Optimisation

May 4, 2024

Why we may need to optimize a model with a “preferential data”:

want our AI coding assistant to understand common programming mistakes in order to correct them, nevertheless, when generating code, we would like to bias our model toward the high-quality coding ability present in its training data.
want our language model to be aware of a common misconception believed by 50% of people without the model itself believing in those misconceptions. In other words, we certainly do not want the model to claim this misconception to be true in 50% of queries about it!

Selecting the model’s desired responses and behaviour from its very wide knowledge and abilities is crucial to building Al systems that are safe, performant, and controllable.

The paper shows that RL-based objective used by existing methods can be exactly replicated with a simple binary cross-entropy objective. This greatly simplifying the preference learning pipeline. As we know, training using RL can be a PITA.

RLHF (Reinforcement Learning from Human Feedback)

RLHF might rely on a theoretical preference model (Such as Bradley-Terry model) that measures how well a given reward function aligns with empirical preference data. What this means is that we need a reward function which is consistent with the human preferences.

RLHF pipeline like in Ziegler et al. follows three phases:

1. Supervised Finetuning

Finetuning a pretrained LM with supervised learning on a high-quality data for the downstream tasks of interest., This gives us a model $\pi^{SFT}$

2. Reward Modelling Phase:

Prompt SFT with prompts x to produce pairs of answers $(y_1, y_2) \sim \pi^{SFT}$
Humans prefer one answer for the other: $y_w > y_l | x$ Preference is assumed to be generated by some latent reward model $r^*(y, x)$ which we don’t have access to.
Example of a model preferences model: Bradley - Terry (BT)
The human distribution $p^*$ can be written as:

$$p^*(y_1 > y_2 | x) = \frac{e^{r^*(x, y_1)}}{e^{r^*(x, y_1)} + e^{r^*(x, y_2)}}$$

(It is possibly written as power of e to ensure both the terms are positive. $r_1 / (r_1 + r_2)$ may cause issues since $r_i \in R$ and maybe negative.)
Now we try to find a parametrised $r_\phi(x, y)$ which is able to estimate the above reward function which is able to explain the human preferences. The reward function for that would look like:

$$\mathcal{L}_R(r_\phi, D) = \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}[-\log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))]$$

By minimising the above, we are motivating the reward function to give a higher reward for the winner than the loser.
More often than not, $r_\phi(x, y)$ can be initialized to $\pi^{SFT}$ with a linear layer in the end to output a scalar value.

3. RL Fine Tuning Phase:

All that is left is to use the reward function trainer above to improve the original policy of the LLM. The final reward function for the model would look like:
$$\max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(y|x)}[r_\phi(x, y)] - \beta \mathbb{D}_{KL}[\pi_\theta(y|x) || \pi_{ref}(y|x)]$$
Due to the causal and discrete nature of generation where we lose gradients, we can’t optimize the above function directly. This is where RL helps us with algorithms like REINFORCE and PPO!

So contrary to how one would imagine “RL“HF to take pace in some on-line feedback loop fairy world. The process seems quite applicable as an “offline” method! Quite similar to how molecule generation LMs using some chemical property as a reward and train the LM. (MoleGular, MoIGPT etc).

Final Reward: $$Reward(x, y) = r_\phi(x, y) - \beta (\log \pi_\theta - \log \pi_{ref})$$

Even though we have formulated the problem well, using any RL algorithm (PPO, TRPO) etc in practise is quite hard.

Direct Preference Optimization

Unlike prior RLHF methods, which learn a reward and then optimize it via RL, our approach leverages a particular choice of reward model parameterization that enables extraction of its optimal policy in closed form, without an RL training loop.

The section 4 of the paper starts with the motive: Reparameterize the human preference function $p^*$ having the ideal reward to be in the form where it has the optimal policy rather than the reward model. Following a bit of maths which is beyond the requirement to understand the gist of the paper, we get the policy objective as:

$\mathcal{L}_{DPO}(\pi_\theta; \pi_{ref}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} \right) \right]$

With this, we fit an implicit reward. The optimal policy for the above case would simply be $\pi_\theta$.

Trying to understand different component of the reward function.

$\nabla_\theta \mathcal{L}_{DPO}(\pi_\theta; \pi_{ref}) = -\beta \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \underbrace{\sigma(\hat{r}_\theta(x, y_l) - \hat{r}_\theta(x, y_w))}_\text{higher weight when reward estimate is wrong} \left[ \underbrace{\nabla_\theta \log \pi(y_w|x)}_\text{increase the likelihood of $y_w$} - \underbrace{\nabla_\theta \log \pi(y_l|x)}_\text{decrease the likelihood of $y_l$} \right] \right]$

where $\hat{r}_\theta(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{ref}(y|x)}$ is the reward implicitly defined by the language model $\pi_\theta$.

The above gradient function makes sense with hand-wavy maths. Where:
- We increase the likelihood of the winner,
- Decrease the likelihood of the loser,
- weigh the gradient change according to how incorrectly we are rewarding the samples $(x, y_w, y_l) \sim \mathcal{D}$.
  - If the estimates are way off, we need to make bigger correction.
  - If they are somewhat okay, don’t change $\theta$ that much.

Your LM is secretly a reward model:

This section in the paper proves how the reward reparametrisation: $\hat{r}_\theta(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{ref}(y|x)}$, does not constrain the class of the learned reward models, and allows for the exact recovery of the optimal policy. Most of the proofs are cited from the appendix and hence left off in these notes.

Experiments and comparisons:

We find that DPO produces by far the most efficient frontier, achieving the highest reward while still achieving low KL. This result is particularly notable for multiple reasons. First, DPO and PPO optimize the same objective, but DPO is notably more efficient; DPO’s reward/KL trade off strictly dominates PPO. Second, DPO achieves a better frontier than PPO, even when PPO can access ground truth rewards (PPO-GT)
We evaluate different methods by sampling completions on the test split of TL;DR summarization dataset, and computing the average win rate against reference completions in the test set. DPO, PPO and Preferred-FT all fine-tune the same GPT-J SFT model. We find that DPO has a win rate of approximately 61% at a temperature of 0.0, exceeding the performance of PPO at 57% at its optimal sampling temperature of 0.0. DPO also achieves a higher maximum win rate compared to the best of N baseline. We note that we did not meaningfully tune DPO’s $\beta$ hyperparameter, so these results may underestimate DPO’s potential. Moreover, we find DPO to be much more robust to the sampling temperature than PPO, the performance of which can degrade to that of the base GPT-J model at high temperatures.