59. LLM as a judge: use cases and evaluations.

Let's go inception. LLM judging other LLMs. How does that work? How can you evaluate them?

Oct 20, 2024

Introduction

In the context of RLHF, we are requiring human annotators to evaluate LLM output. But… why not create an LLM to evaluate the output?

Collecting data from humans is: noisy, time consuming and, most importantly, expensive!

Let’s see how that works in practice, and how you can evaluate the evaluators.

How does that work?

LLM-as-a-judge is a reference-free metric that directly prompts a powerful LLM to evaluate the quality of another model’s output.

Despite its limitations, this technique is found to consistently agree with human preferences in addition to being capable of evaluating a wide variety of open-ended tasks in a scalable manner and with minimal implementation changes.

To perform evaluations with LLM-as-a-Judge, all we need to do is write a prompt!

You can broadly decide between:

Pairwise comparison: the judge is presented with a question and two model responses and asked to identify the better response.
Pointwise scoring: the judge is given a single response to a question and asked to assign a score.
Reference-guided scoring: the judge is given a reference solution in addition to the question and response(s) to help with the scoring process.

Having the LLM output a human-readable rationale along with its score is an easy and useful explainability trick. We can use these explanations to gain a deeper understanding of a model’s performance and shortcomings.

Many of these weaknesses lead to corresponding biases within LLM-as-a-Judge evaluations:

Position bias: the judge may favor outputs based upon their position within the prompt (e.g., the first response in a pairwise prompt).
Verbosity bias: the judge may assign better scores to outputs based upon their length (i.e., longer responses receive higher scores).
Self-enhancement bias: the judge tends to favor responses that are generated by itself

Reducing bias comes down to applying the right technique depending on the type:

Randomizing the position of model outputs within the prompt, generating several scores, and taking an average of scores with different positions.
Providing few-shot examples to demonstrate the natural distribution of scores and help with calibrating the judge’s internal scoring mechanism.
Providing correct answers to difficult math and reasoning questions within the prompt as a reference for the judge during the evaluation process.
Using several different models as a judge (e.g., Claude, Gemini and GPT-4) to lessen the impact of self-enhancement bias.

Choose your own setup, experiment and evaluate it!

Wait… evaluate?

Yeah, that’s right. You need to evaluate it! Let’s see how.

Evaluations

What are even comparing the LLM against?

Most probably you will have human annotators as the baseline. Here, you’d aim for the LLM-human correlation to match human-human correlation. Compared to human annotators, LLM-evaluators can be orders of magnitude faster and cheaper, as well as more reliable.

However, in most recent RLHF people actually use a finetuned classifier / reward model to score responses. Then, in this case the goal is for the LLM-evaluator to achieve similar recall and precision as a finetuned classifier. This is a more challenging baseline. Furthermore, LLM-evaluators are unlikely to match the millisecond-level latency of a small finetuned evaluator, especially if the former requires Chain-of-Thought (CoT). LLM-evaluators likely also cost more per inference.

What metrics do we use to evaluate them?

Classification and correlation metrics are typically adopted in the literature and industry.

Classification metrics are more straightforward to apply and interpret. For example, we can evaluate the recall and precision of an LLM-evaluator at the task of evaluating the factual inconsistency or toxicity of responses

Correlation metrics are a bit more challenging

They don’t account for chance agreement and thus could be overoptimistic. Furthermore, compared to classification metrics, it’s less straightforward to translate correlation metrics to performance in production. (What’s the evaluator’s recall on bad responses? What about false positive rate?)

I think it’s just better to stick to the classification metrics when possible.

Closing thoughts

Personally I believe it makes total sense to use LLM evaluators instead of human evaluators. However, as we have just seen in today’s article, there are always some kind of tradeoff:

LLM as a judge could be more expensive in special cases (CoT)
Choosing the right prompt is crucial and even more crucial is to have guardrail in place to mitigate bias
Evaluate LLM as a judge approach with… classic metrics! ;)

Are you using LLM-as-a-judge at your current job?

Ludo

Machine learning at scale

Discussion about this post

Ready for more?