Reinforcement learning from human feedback

Table of contents

Introduction: The primary objective of a Language Model (LM) is to provide accurate and helpful solutions tailored to the user's requirements, enabling them to achieve their tasks efficiently. However, in some instances, users may have intentions that lead to harmful and unlawful outcomes. These issues encompass instances where models employ toxic language, respond in a confrontational or aggressive tone, or provide detailed information on dangerous subjects.

These challenges arise because large-scale models are trained on extensive datasets from the internet, where such problematic language is prevalent. Consider, for example, a scenario where you want an LM to answer an age-old riddle, but it interprets the query incorrectly and provides a tangential and nonsensical response. In such cases, the completion fails to offer a useful solution to the given task. Similarly, LLMs may provide misleading or blatantly incorrect answers. It is crucial to emphasize that LLMs must refrain from generating harmful completions, such as offensive content, discriminatory language, or guidance on engaging in criminal activities.For instance, asking the model how to breach a neighbor's WiFi network should result in a response that does not endorse harm or illegal actions. These essential human values—helpfulness, honesty, and harmlessness—are collectively known as the HHH principles. They serve as a guiding framework for developers in the responsible deployment of AI technology. Further refinement of LLM behaviour through fine-tuning, informed by human feedback, plays a pivotal role in aligning these models with human preferences. This iterative training process enhances the helpfulness, honesty, and harmlessness of the generated completions, thereby reducing toxicity and the generation of inaccurate information.

Reinforcement Learning: The concept of Reinforcement learning is applied in this model to achieve the feedback and perform certain tasks on reward basis. Reinforcement Learning is a type of machine learning in which an agent learns to make decisions related to a specific goal by taking actions in an environment, with the objective of maximizing some notion of a cumulative reward. In this framework, the agent continually learns from its experiences by taking actions, observing the resulting changes in the environment, and receiving rewards or penalties, based on the outcomes of its actions. By iterating through this process, the agent gradually refines its strategy or policy to make better decisions and increase its chances of success.Fine-tuning large language models with RLHF: In this context, the Language Model (LLM) serves as the agent's policy, guiding its actions to generate text that aligns with human preferences, including qualities like helpfulness, accuracy, and non-toxicity. The environment corresponds to the context window of the model, the space where text is entered through a prompt. The state, influencing the agent's actions, is represented by the current context, encompassing any text within the context window.

Actions performed by the agent entail the generation of text, which may range from individual words to complete sentences, depending on the user's specified task. The action space is defined by the token vocabulary, comprising all possible tokens the model can utilise to produce text. The decision-making process for the LLM regarding the next token in the sequence relies on its learned statistical representation of language acquired during training. At any given point, the model's action, i.e., the token selection, is contingent on both the prompt text within the context and the probability distribution over the token vocabulary.Reward assignment is a critical component, contingent upon how closely the generated text aligns with human preferences. However, assessing reward in the context of language is nuanced due to the diversity of human responses. One approach involves human evaluation of model completions against alignment metrics, such as toxicity assessment, yielding scalar feedback values of zero or one. These LLM weights are then iteratively updated to maximize the rewards derived from the human evaluator, thus enabling the model to produce non-toxic text.However, obtaining human feedback can be resource-intensive. As a practical and scalable alternative, a reward model, distinct from the LLM, can be employed to classify LLM outputs and evaluate their alignment with human preferences. Initial training of this secondary model relies on a smaller set of human-labeled examples, following traditional supervised learning methodologies. Once trained, the reward model is utilized to assess LLM outputs, assigning reward values that, in turn, inform the adjustment of LLM weights to align more closely with human preferences.Reward Model :The reward model will effectively take place off the human labeler and automatically choose the preferred completion during the oral HF process. This reward model is usually also a language model For a given prompt X, the reward model learns to favor the human-preferred completion.

Once the model has been trained on the human rank prompt-completion pairs, you can use the reward model as a binary classifier to provide a set of logics across the positive and negative classes. Logics are the unnormalized model outputs before applying any activation function.

Let's say you want to detoxify your LLM, and the reward model needs to identify if the completion contains hate speech. In this case, the two classes would be notate, the positive class that you ultimately want to optimize for and hate the negative class you want to avoid.

Proximal Policy Optimization Algorithm :

In this blog, we'll dive into the Proximal Policy Optimization (PPO) Algorithm to better understand it and see how we can use it in Reinforcement Learning with human feedback, Proximal Policy Optimization (PPO) is a robust algorithm designed for tackling reinforcement learning problems effectively. True to its name, PPO focuses on optimizing a policy, specifically the Language Model (LLM), to better align with human preferences. Through numerous iterative steps, PPO carefully adjusts the LLM, ensuring that these adjustments are incremental and restricted within a predefined range. This approach yields an updated LLM that remains in close proximity to its previous version, hence the name "Proximal Policy Optimization." By confining changes within this limited range, PPO fosters a more stable and controlled learning process. The ultimate objective is to enhance the policy in such a way that it maximizes the obtained rewards. Phase 1: Create CompletionsIn the realm of Reinforcement Learning and PPO, assessing the quality of generated text is crucial. To accomplish this, we rely on a tool known as the "value function." Think of it as a quality checker for the text we produce. Imagine you're crafting a story, and at each step, you want to gauge how well the story is coming along.Here's how it works: We provide our Language Model (LLM) with prompts or questions, and it generates responses. Subsequently, a reward model evaluates how closely those responses align with our criteria. This is where the value function steps in. Its role is to estimate the quality of the responses as they are generated, essentially evaluating the text's quality word by word.The ultimate objective is to minimize what we call "value loss." This term signifies the difference between the actual future rewards we obtain and the rewards we anticipated. By doing so, we ensure that the text we generate consistently meets our predefined quality standards.

Phase 2 : Policy LossIn Phase 2, we implement small updates to the model and assess how these changes impact our goal of aligning the model with human preferences. These updates to the model's weights are driven by factors like prompt completions, losses, and rewards. Importantly, we ensure that these model updates stay within a limited range, often referred to as the "trust region."This adherence to a bounded region is where the "proximal" part of PPO comes into play. Our aim is for these incremental updates to guide the model toward higher rewards. Now, let's delve into the core of PPO: the policy objective. Remember, our objective is to find a policy that yields high expected rewards. In simpler terms, we want to tweak the LLM's weights in a way that makes its text completions more in line with human preferences, thus earning higher rewards. The policy loss is the primary focus of the PPO algorithm during training.While the mathematical expressions might appear complex at first glance, they are more straightforward than they seem. Let's break it down step by step. First, concentrate on the most crucial part and disregard the rest for now. In this context, Pi of ‘a t’ given ’s t’ represents the probability of the next token ‘A t ‘, given the current prompt S_t. Here, the action A_t refers to the next token, and the state S_t signifies the completed prompt up to token’ t ’. The denominator relates to the probability of the next token using the initial, unaltered LLM. The numerator, on the other hand, accounts for the probabilities of the next token as influenced by the updated LLM, which we can modify to achieve better rewards.A t is known as the "estimated advantage term" for a particular choice of action. This term gauges how much better or worse the current action is when compared to all potential actions within the given context. We consider the expected future rewards of a completion following the new token, estimating the advantage of this particular completion relative to the alternatives.

Conclusion: In this blog, we've delved into the fundamental concepts of Reinforcement Learning from Human Feedback (RLHF), providing a comprehensive overview of this intriguing framework. We explored its key components and principles, using the example of Proximal Policy Optimization (PPO) to illustrate its application. RLHF is a powerful approach that harnesses the capabilities of PPO to fine-tune Language Models (LLMs) in alignment with human preferences.It unfolds in two distinct phases, with the first involving the collection of invaluable human feedback. This feedback forms the basis for training a reward model, which quantifies how well the LLM's text completions align with human preferences. The second phase is where the magic of PPO comes into play. Through small and carefully bounded updates, the LLM is refined to optimize its policy. The pivotal policy loss propels the model towards better alignment with human preferences, while entropy ensures that creativity remains intact during training.Hyperparameters like C1 and C2 fine-tune the balance between these aspects. As we embark on an iterative journey of model updates, the LLM gradually evolves, becoming increasingly human-aligned. The ultimate goal is to generate text that not only matches human preferences but also maintains a level of creativity that keeps the language nuanced and engaging. In conclusion, RLHF, driven by the PPO algorithm, offers a powerful framework to train Language Models that can produce text that resonates more closely with human preferences, striking a harmonious balance between alignment and creativity. This approach paves the way for more sophisticated and nuanced language generation, unlocking a realm of possibilities in the world of artificial intelligence and natural language understanding

Reinforcement learning from human feedback

goML Team

Siddharth Menon

Rishabh Sood

Accelerate Your AI Adoption

Get an Executive Briefing

HQ

India

Reinforcement learning from human feedback

goML Team

Similar Blogs

The evolution of machine learning in 2025

Siddharth Menon

What is MLOps?

Rishabh Sood

Accelerate Your AI Adoption

Get an Executive Briefing​

HQ

India​

Get an Executive Briefing

India