The role of human preferences in model alignment

The role of human preferences in model alignment

Olga Megorskaya is the founder and CEO of Toloka AIa high-quality data partner for all phases of AI development.

If you’ve ever turned to ChatGPT to self-diagnose a health problem, you’re not alone – but be sure to check everything it tells you. A recent study found that advanced LLMs, including the best-performing GPT-4 model, responded to medical questions with unsupported statements almost half the time. It’s fair to say that we shouldn’t trust these models with our health decisions.

How can GPT-4 and other GenAI models perform better? It’s a matter of alignment—a process aimed at making models helpful, truthful, and harmless—and the AI ​​community is still figuring out how to best align models with our expectations.

Why alignment matters

Through the LLM orientation, models are trained to follow our instructions and behave ethically. We don’t want models to give biased, toxic or unfair answers. However, because human ethics are complex, targeting requires large amounts of data with examples of good and bad reactions.

Targeting data attempts to be helpful, truthful, and harmless. Often it prioritizes security while still being useful, offering nuanced response variations tailored to the model’s use case.

For medical questions, the model should be trained not to provide a diagnosis or medical advice, but still offer helpful information supported by medical references. As another example, enterprise AI applications require customization so that the model aligns with the company’s values ​​and internal policies and government regulations.

If model answers fail in a particular aspect, such as veracity, additional adjustment is required. Some models can get by with 10,000 to 20,000 data samples for reasonable alignment, but more data and higher quality data usually result in better model performance.

What the alignment looks like

Alignment is an optimization process that involves fine-tuning a model, usually as the final stage of model training. Two popular alignment techniques are RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization). In both approaches, the model gives different answers and the human decides which one is better. This data is used by the alignment algorithm to train the model.

Alignment data is highly individual. The first step in designing effective targeting data is to create a security policy that describes exactly what type of model behavior is unacceptable, and then tailor the prompts to specific risks. Sophisticated data collection methods provide more detailed feedback. There is no universal solution as every use case is different.

Where does alignment data come from?

When collecting targeting data, AI developers have the option to use synthetic data, custom human preference data, or a mix of both.

Synthetic approaches use a different LLM to provide the feedback. Typically, the LLM is trained on a few human-generated examples and then instructed to judge the model results in the same way a human would. On their own, synthetic data has many limitations, including potential biases and limited depth in specific areas. A hybrid approach involving human experts takes the model to a higher level of competence.

Human expert feedback is the gold standard for showing how a model should behave. To get this feedback quickly, data production companies have networks of trained experts who can prepare the data as needed, while some AI companies hire their own experts.

For complex topics in medicine, law, coding, compliance, or other niche areas, feedback must come from experts with the right background for deep understanding – professionals with advanced degrees and years of experience in the field.

In our experience, effective targeting data is best captured through complex pipelines that leverage a range of advanced technologies, including automated quality control, as well as human expertise. This step can be a real challenge. Therefore, it is important to work with a data partner that offers a technological platform as well as an extensive network of experts to support scaling and expertise in your model’s focus area.

Have an impact on the security and trust of AI

No one wants to use AI applications that no longer work – and companies can’t afford to take risks with poorly tuned models. Alignment protects companies and users, prevents malicious use of AI products, and ensures regulatory compliance. With effective human feedback pipelines, we can take alignment a step further to bring ethical insights into the models we train.

Alignment is a key piece of the responsible AI puzzle. Better alignment with ethical standards will lead to greater trust in AI systems and higher user adoption. By championing responsible AI, we have the potential to advance AI security efforts across the industry. Together we can explore model alignment more deeply and refine the ethics of AI.


The Forbes Technology Council is an invitation-only community for top CIOs, CTOs and technology executives. Am I qualified?


Leave a Reply

Your email address will not be published. Required fields are marked *