Benchmarking the FFM

Sounds like a p0*n title. But I promise you it is not.

So I just ranted with amazement about some of the unexpected frontiers that are being broken, enterprise-wise, by FFMs together with RAG.

Endeavoring on a journey to adopt such a model for you, and integrating it in a RAG pattern is ultimately trivial. Strictly talking from a software engineering perspective.

However, from a data science perspective, you must be able to evaluate the result. How capable is your model in performing your scenario.

Evaluating LLM FFMs is a science in itself, but there are very relevant benchmarks that you can use in order to gauge any LLM. Let’s briefly explore a few, before focusing on how you could evaluate your RAG scenarioo (hint to the bolded ones).

  • MMLU (Massive Multitask Language Understanding)
    Generally used to identify a model’s blind spots. General cross-domain evaluation. Relevant evaluation in zero-shot, few-shot and “medprompt+” configs.
    Competititve threshold: medprompt+, > 90%
  • GSM8K
    Mathematical problem solving with training dataset. Multi-step mathematical reasoning benchmarking.
    Competitive threshold: zero-shot >95%
  • MATH
    Mathematical problem solving without training dataset. In exchange the MATH dataset can be used for training instead of evaluation. Or on a 1-shot configuration.
    Competitive threshold: zero-shot ~70%
  • HumanEval
    used for LLMs trained on code. Kind of the standard here.
    Competitive threshold: zero-shot >95%
  • BIG-bench (Beyond the Imitation Game Benchmark|
    Mining for future capabilities.
    Competitive threshold: few-shot + CoT ~ 90%
  • DROP (Discrete Reasoning Over Paragraphs)
    Currently 96k question for reference resolving in questions, to multiple input positions, with various operations over portions of the input positions. This benchmark measures the level of comprehensive understanding. it is split into a training set and a development set, making it ideal for evaluating a RAG capability.
    Competitive threshold: few-shot + CoT ~ 84%
  • HellaSwag
    Evaluation of generative capabilities for NLI problems. Human treshold is accepted at 95% for this one. TL;DR; if a HellaSwag benchmark scores 95% or more, then the generative capabilities of the model are human-like. This is what you want and nothing less.

I took the liberty to add some competitive thresholds, in case you need some orientation in this evolving landscape. Take these thresholds with a grain of salt. They are based on my experience and some research that has gone into this material. Nevertheless, there should be a red flag if you’re running a FFM benchmarked lower than these.

Back to the problem at hand, you r RAG setup can easily be evaluated with a combination of DROP benchmark and HellaSwag. HellaSwag should be as high as possible, and your DROP is able to measure how well your model can generate.

You can go an extra mile and take a look at the DROP dataset, and replace those paragraphs with paragraphs from your RAG scenario, and then run an benchmarking experiment. A little birdie told me that this is relevant if done correctly.

However., all the datasets, benchmarking algoritms (already implemented) are available with (various) open licenses. For example. you can find implementations and the datasets for ALL the benchmarks I have mentioned above at https://paperswithcode.com/

Happy new year!

RAG(e) Against the Machine

Formulating the latest LLM leaps as foundation models has opened a box of infinite possibilities. If you were living on the Earth for the past 24-36 months, this is not news.

The RAG Pattern now made its way into very niche areas.

But first, a little (his)story

Remember the era of the chatbots? Then the era of “synthetic chatbots”. You know, the ones that answered the phone when you wanted to solve a problem with your (bank / xSP)? Those are (or maybe were) just clever expert systems, covered by capable voice synthesizers. Yes, an expert system is still AI, the voice synthesizers are nowadays also built with a sort of generative AI model.

You know why they are still around ?

Because they make a difference. Dollar-wise.

Context

Foundational LLMs used with RAG quickly found their way into the mode technical aspects of human communications. Engineering that is. IT&C Engineering to be more precise.

For example, operation centers, including SOCs, very quickly adapted to this new reality and implemented RAG out of the box for second and third level support. Basically, what this means, in a nutshell, is that when you build a support team (second and third level) for a product, team members DO NOT have to spend time reading any type of written manuals. Zero.

Another really cool example is in cybersecurity. You can no have (and you do have) solutions in place that do “assume breach”-level monitoring, and you can query their status using natural language. This is achieved by indexing definitions of cybersecurity concepts together with the output of the smart monitoring tools. This is already pretty cool. But this is not the main subject of incursion.

The intrigue

I got into a discussion with one of my friends the other days. He is close to some content moderation ecosystems.

For a bit of a context, content moderation is a business where workers are subject to an EXTREME cern-rate. Look it up for yourself. The average will blow your mind.

Training employees (to use the tools), offering first level support for the software ecosystem serving the content moderators is an operation that has the second impact in cost for this business.

Well, they have eliminated the need to:

  • do technical training for any employee (new, old, whatever)
  • there is no first level support anymore

Why, because RAG can.

I’ve got some neat insights on how they’ve done it, but this is another subject.

This is big. This is crossing barrier. Completely eliminating first level support for an operation where this accounts for a lot of cost is big. Even if their audience is fairly technical, it is still a big achievement.

I can see a future where…

There will be an acceptance criteria for various vendors inside an enterprise ecosystem that your documentation must be capable of integration with their RAG solution because if not, the operational cost is 50% higher.

Hey, I’ve seen mission-critical workloads that have shit (the stinky kind) documentation.
Not once.

It’s not that this happens that intrigues me, but the speed at which this is happening. Or maybe I am too old already. It may have something to do with the fact that cloud providers already have the “RAG SDK” out and ready. Well, this is good news after all.

Peace.