Benchmarking the FFM

Sounds like a p0*n title. But I promise you it is not.

So I just ranted with amazement about some of the unexpected frontiers that are being broken, enterprise-wise, by FFMs together with RAG.

Endeavoring on a journey to adopt such a model for you, and integrating it in a RAG pattern is ultimately trivial. Strictly talking from a software engineering perspective.

However, from a data science perspective, you must be able to evaluate the result. How capable is your model in performing your scenario.

Evaluating LLM FFMs is a science in itself, but there are very relevant benchmarks that you can use in order to gauge any LLM. Let’s briefly explore a few, before focusing on how you could evaluate your RAG scenarioo (hint to the bolded ones).

  • MMLU (Massive Multitask Language Understanding)
    Generally used to identify a model’s blind spots. General cross-domain evaluation. Relevant evaluation in zero-shot, few-shot and “medprompt+” configs.
    Competititve threshold: medprompt+, > 90%
  • GSM8K
    Mathematical problem solving with training dataset. Multi-step mathematical reasoning benchmarking.
    Competitive threshold: zero-shot >95%
  • MATH
    Mathematical problem solving without training dataset. In exchange the MATH dataset can be used for training instead of evaluation. Or on a 1-shot configuration.
    Competitive threshold: zero-shot ~70%
  • HumanEval
    used for LLMs trained on code. Kind of the standard here.
    Competitive threshold: zero-shot >95%
  • BIG-bench (Beyond the Imitation Game Benchmark|
    Mining for future capabilities.
    Competitive threshold: few-shot + CoT ~ 90%
  • DROP (Discrete Reasoning Over Paragraphs)
    Currently 96k question for reference resolving in questions, to multiple input positions, with various operations over portions of the input positions. This benchmark measures the level of comprehensive understanding. it is split into a training set and a development set, making it ideal for evaluating a RAG capability.
    Competitive threshold: few-shot + CoT ~ 84%
  • HellaSwag
    Evaluation of generative capabilities for NLI problems. Human treshold is accepted at 95% for this one. TL;DR; if a HellaSwag benchmark scores 95% or more, then the generative capabilities of the model are human-like. This is what you want and nothing less.

I took the liberty to add some competitive thresholds, in case you need some orientation in this evolving landscape. Take these thresholds with a grain of salt. They are based on my experience and some research that has gone into this material. Nevertheless, there should be a red flag if you’re running a FFM benchmarked lower than these.

Back to the problem at hand, you r RAG setup can easily be evaluated with a combination of DROP benchmark and HellaSwag. HellaSwag should be as high as possible, and your DROP is able to measure how well your model can generate.

You can go an extra mile and take a look at the DROP dataset, and replace those paragraphs with paragraphs from your RAG scenario, and then run an benchmarking experiment. A little birdie told me that this is relevant if done correctly.

However., all the datasets, benchmarking algoritms (already implemented) are available with (various) open licenses. For example. you can find implementations and the datasets for ALL the benchmarks I have mentioned above at https://paperswithcode.com/

Happy new year!