Testing (evaluating model responses against a set of test cases) and monitoring (continuous evaluation in production) are vital elements in the process of the development and continued deployment of an LLM System, they ensure that your system is functioning properly and that your changes to the system bring a positive improvement, and more.
As this is such an important and large subject there are a wide range of approaches and tools available, one of which is the use of LLMs-as-a-Judge. This is the use of an LLM to evaluate the quality of a response generated by an LLM. This has become a popular area of research due to the expensive nature of evaluation by humans and by the improved ability of LLMs since the advent of GPT4.
For example in our RAG use case, you may present an LLM evaluator with a text input explaining the companies policy explaining what an employee’s responsibilities are with regard to document control, followed by a sentence stating: “Employees must follow a clean desk policy and ensure they have no confidential information present and visible on their desk when they are not present there”. You would then ask the evaluator if this statement is true and for an explanation of why it is or is not, given the article explaining employee responsibilities.
The effectiveness of an evaluator can be measured either using classification or correlation metrics, with the latter being more difficult to use than the former. Examples of classification metrics include: Accuracy - measures the proportion of outputs which are correct; Precision - measures what proportion of the retrieved information is relevant information; Recall - measures the proportion of relevant information out of all the relevant information in the corpus. An explanation and evaluation of correlation metrics can be found here.
A range of research has found that LLMs can be used as effective evaluators of the outputs of other LLMs, even showing that while a fine-tuned in-domain model can achieve higher accuracy on specific in-domain tests generalised LLMs can be more “generalised and fair”, meaning depending on the specific use case it may be less effective to create a bespoke evaluator. The literature has shown a range of effective approaches at this evaluation, and given the recent nature of this approach it is probable that more effective approaches will be found, along with existing approaches improving as state-of-the-art models improve. Given all this, it is highly recommended to introduce an LLM based evaluator to a testing procedure, but it is important to have human oversight in the use of this evaluation to verify its results as with all things stemming from LLMs there is a lot of scope for error and mistake. This can mean not just taking evaluation scores but looking into the confusion matrix and getting an understanding of what the evaluation is telling you, as this is a tool to make it easier to find potential issues not something that you can just set and no longer have to worry about your system.
Potential Tools and Approaches
- Cross Examination
- Zero-Resource Black-Box Hallucination Detection
- Pairwise preference search
- Fairer preference optimisation
- Relevance assessor
- LLMs-as-juries
- Summarisation Evaluation
- NLG Evaluation
- MT-Bench and Chatbot arena
Links
- https://eugeneyan.com/writing/llm-evaluators/
- https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG
- https://www.databricks.com/blog/announcing-mlflow-28-llm-judge-metrics-and-best-practices-llm-evaluation-rag-applications-part
- https://medium.com/thedeephub/evaluation-metrics-for-rag-systems-5b8aea3b5478
- https://www.databricks.com/blog/enhancing-llm-as-a-judge-with-grading-notes