Benchmarking the FFM

Sounds like a p0*n title. But I promise you it is not.

So I just ranted with amazement about some of the unexpected frontiers that are being broken, enterprise-wise, by FFMs together with RAG.

Endeavoring on a journey to adopt such a model for you, and integrating it in a RAG pattern is ultimately trivial. Strictly talking from a software engineering perspective.

However, from a data science perspective, you must be able to evaluate the result. How capable is your model in performing your scenario.

Evaluating LLM FFMs is a science in itself, but there are very relevant benchmarks that you can use in order to gauge any LLM. Let’s briefly explore a few, before focusing on how you could evaluate your RAG scenarioo (hint to the bolded ones).

  • MMLU (Massive Multitask Language Understanding)
    Generally used to identify a model’s blind spots. General cross-domain evaluation. Relevant evaluation in zero-shot, few-shot and “medprompt+” configs.
    Competititve threshold: medprompt+, > 90%
  • GSM8K
    Mathematical problem solving with training dataset. Multi-step mathematical reasoning benchmarking.
    Competitive threshold: zero-shot >95%
  • MATH
    Mathematical problem solving without training dataset. In exchange the MATH dataset can be used for training instead of evaluation. Or on a 1-shot configuration.
    Competitive threshold: zero-shot ~70%
  • HumanEval
    used for LLMs trained on code. Kind of the standard here.
    Competitive threshold: zero-shot >95%
  • BIG-bench (Beyond the Imitation Game Benchmark|
    Mining for future capabilities.
    Competitive threshold: few-shot + CoT ~ 90%
  • DROP (Discrete Reasoning Over Paragraphs)
    Currently 96k question for reference resolving in questions, to multiple input positions, with various operations over portions of the input positions. This benchmark measures the level of comprehensive understanding. it is split into a training set and a development set, making it ideal for evaluating a RAG capability.
    Competitive threshold: few-shot + CoT ~ 84%
  • HellaSwag
    Evaluation of generative capabilities for NLI problems. Human treshold is accepted at 95% for this one. TL;DR; if a HellaSwag benchmark scores 95% or more, then the generative capabilities of the model are human-like. This is what you want and nothing less.

I took the liberty to add some competitive thresholds, in case you need some orientation in this evolving landscape. Take these thresholds with a grain of salt. They are based on my experience and some research that has gone into this material. Nevertheless, there should be a red flag if you’re running a FFM benchmarked lower than these.

Back to the problem at hand, you r RAG setup can easily be evaluated with a combination of DROP benchmark and HellaSwag. HellaSwag should be as high as possible, and your DROP is able to measure how well your model can generate.

You can go an extra mile and take a look at the DROP dataset, and replace those paragraphs with paragraphs from your RAG scenario, and then run an benchmarking experiment. A little birdie told me that this is relevant if done correctly.

However., all the datasets, benchmarking algoritms (already implemented) are available with (various) open licenses. For example. you can find implementations and the datasets for ALL the benchmarks I have mentioned above at https://paperswithcode.com/

Happy new year!

RAG(e) Against the Machine

Formulating the latest LLM leaps as foundation models has opened a box of infinite possibilities. If you were living on the Earth for the past 24-36 months, this is not news.

The RAG Pattern now made its way into very niche areas.

But first, a little (his)story

Remember the era of the chatbots? Then the era of “synthetic chatbots”. You know, the ones that answered the phone when you wanted to solve a problem with your (bank / xSP)? Those are (or maybe were) just clever expert systems, covered by capable voice synthesizers. Yes, an expert system is still AI, the voice synthesizers are nowadays also built with a sort of generative AI model.

You know why they are still around ?

Because they make a difference. Dollar-wise.

Context

Foundational LLMs used with RAG quickly found their way into the mode technical aspects of human communications. Engineering that is. IT&C Engineering to be more precise.

For example, operation centers, including SOCs, very quickly adapted to this new reality and implemented RAG out of the box for second and third level support. Basically, what this means, in a nutshell, is that when you build a support team (second and third level) for a product, team members DO NOT have to spend time reading any type of written manuals. Zero.

Another really cool example is in cybersecurity. You can no have (and you do have) solutions in place that do “assume breach”-level monitoring, and you can query their status using natural language. This is achieved by indexing definitions of cybersecurity concepts together with the output of the smart monitoring tools. This is already pretty cool. But this is not the main subject of incursion.

The intrigue

I got into a discussion with one of my friends the other days. He is close to some content moderation ecosystems.

For a bit of a context, content moderation is a business where workers are subject to an EXTREME cern-rate. Look it up for yourself. The average will blow your mind.

Training employees (to use the tools), offering first level support for the software ecosystem serving the content moderators is an operation that has the second impact in cost for this business.

Well, they have eliminated the need to:

  • do technical training for any employee (new, old, whatever)
  • there is no first level support anymore

Why, because RAG can.

I’ve got some neat insights on how they’ve done it, but this is another subject.

This is big. This is crossing barrier. Completely eliminating first level support for an operation where this accounts for a lot of cost is big. Even if their audience is fairly technical, it is still a big achievement.

I can see a future where…

There will be an acceptance criteria for various vendors inside an enterprise ecosystem that your documentation must be capable of integration with their RAG solution because if not, the operational cost is 50% higher.

Hey, I’ve seen mission-critical workloads that have shit (the stinky kind) documentation.
Not once.

It’s not that this happens that intrigues me, but the speed at which this is happening. Or maybe I am too old already. It may have something to do with the fact that cloud providers already have the “RAG SDK” out and ready. Well, this is good news after all.

Peace.

Probabilistic sudoku using Infer.NET #2

Probably hello again! We started covering the probabilistic solution to the sudoku puzzle, and I have promised a series of articles that will guide you through the journey I took to tackle this topic. Ultimately, we’ll also have a little working example.

Firsts things first. In the previous article I have promised you that I will show you how probabilities are to be included in the graphical model. In order to tacke this please consider the following questions:
What problems does a probabilistic model solve for the sudoku puzzle? How do you imagine a probabilistic sudoku model? When is it appropriate to use a probabilistic model? Now, before going on to read the rest of the content, please spend some time to formulate your own answers to these questions. Anyway, I will provide some useful answers below, because, as you figured, these are key questions in understanding this topic 😊

(Anoying pause … )

Let’s go:
When would probabilities be apropriate for tackling sudoku ? If any sudoku puzzle has one single solution (well-formed sudoku puzzle), then the usage of probabilities can be usefull, but without too much of a benefit. If, however, multiple solutions are possible for a sudoku puzzle, then, using probabilities to solve a puzzle can bring alot of benefits. If you want to find out more about the interesting math behind sudoku, feel free to check this out:
http://pi.math.cornell.edu/~mec/Summer2009/Mahmood/More.html

How do you imagine a probabilistic model for the sudoku game ? Would it be cool to haave, given a sudoku puzzle, and a certain random value (say, 3) a probability distribution for each cell and 3 as a possible value? Check out the image bellow:

Now, I used a 4×4 sudoku as an example. It is pretty dificult to come with a non well-formed 4×4 sudoku, but I made an effort 🙂

white means 0 probability (easy), light shade is low (whatever that means) probability, darker means higher. Same shade means same probability. If you imagine a non well-formed 12×12 puzzle you can see where this probability game comes in handy 🙂

Good, now, remember, in the last article we have established that we are going to use a graphical model (using graph theory) for tackling the probabilistic sudoku challenge. I promised I will show you how probabilities can fit into this model, so here it is:

Each solution/value node in the graph will hold a vector describing probability distribution for each possible value. e.g. for cell number 2 (r1,c2) the probability vector is (0, 50%, 0, 50%). Indexes the vector are 1 through N (4). 4 Because we have a 4×4 puzzle for this example :). Yes, they are not zero based 🙂 This is it. That’s all I wanted to tell you for now.

In order for me give you content that is as engaging as possible, I will do my best to write the next article so that it involves as little as possible in regard to the rest of the mathematics involved in this solution, and go directly to the Inver.NET implementation. Belive me, trying to formulate an article that omits a hard definition of marginal distribution and Belief Propagation (sum-product message passing) and goes directly to using these concepts in code will be dificult.

But I trust that we can do it next time! Probably!

Probabilistic sudoku using Infer.NET #1

Model-based machine learning and the probabilistic programming paradigm is coming to the .NET world. If, by any chance, you are unfamiliar with these topics, please feel free to check the links provided above 🙂

Given this happy situation, I figured that it would be nice to build a probabilistic sudoku solver / generator using the model-based machine learning principles and Infer.NET.

I’ve done it! but trust me, it was a trip worth sharing. I have decided to share this journey with everybody in a series of articles / talks / gatherings.

First, does it make sense to have a probabilistic approach for the sudoku puzzle ? Well, yes it does. It is an NP-complete proven back in 2003 -problem, so it makes sense to optimize the solution as much as possible.

Second, building the probabilistic model for the generic sudoku puzzle turned out to be a significant challenge. This small article is dedicated to just a fraction of the “probabilistic model for the sudoku puzzke” problem, namely the reason for choosing a Probabilistical Graphical Model – a probabilistical model based on graphs. I hope I will be able to write an entire series of articles that will take you through the entire journey I took. Doing this in just one article is impossible, and, let me tell you, extenuating to read and understand. So let’s begin.

Imagine a simple 4×4 sudoku puzzle

Fig 1. 4×4 sudoku puzzle

Trying to drill down the components of a sudoku puzzle, we can come up with the following:

  1. The Solution elements. The elements of the solution (numbers inside). Let’s call such an element Sn, In a row-scan order, for the sample above: (1,2,4,3,4,3,1,2,2,1,3,4,3,4,2,1)
  2. The constraints. You know the rules of sudoku, right? Then, constraints are the groups in which solution elements live. For the 4×4 sudoku puzzle, there are 12 of them. Please see the Fig 2. bellow.
Fig 2. Constraints for the sudoku puzzle

The first constraint (C1) is the first row. You get the idea. (C9) is the first subgrid constraint.

Now the relationship between the constraint and solution nodes (Cm and Sn) can be a bipartite graph as follows

Fig 3. The graphical model

Again, you get the idea. The information held by the bipartite graph is: what constraint applies to what set of values.

That’s enough for now. Trust me! In the next article I will try to show you how probabilities can fit into this model.

If, by any chance you want to skip-forward to the Infer.NET solution, make sure you read and understand the following paper that I used to create the probabilistic model for the sudoku puzzle, and snip out some pictures for this article.

Until the next article in the series, remember: the world is beautiful! Probably.