Decoupling of scales and the plateau

It is a well known fact that LLMs will reach multiple plateaus during their development. Right about now there should be one. Even Bill Gates is talking about it. Even Yan LeCun. (look it up on X). Altough, to be honest, I’m not really counting LeCun as an expert here. He’s more like a Mike Tyson of the field ( What? Too much? Too Early? šŸ™‚ )

This time, the reason behind these plateaus is not the lack of technology (as it was the case with previous AI waves (50s, 80s – the LeCun Era – 2000s).
This time, the reason is extremely profound and extremely close to humans. It is concerning an interesting aspect of epistemology: the decoupling of scales issue.
What this means is, in a nutshell is, that you cannot use an emergent level as an indicator for an underlying reality without stopping on the road and gathering mode data (experiments and observations) , and redoing your models entirely.

My argument is that our models, now, at best, they do clever “Unhobbing” (a basic feedbackloop that somewhat emulates a train of tought by iterating answers)

We have not yet seen, at all a real bake-in of a new “underlying reality model”.

I don’t care about AGI, and when or if we will truly reach general intelligence! I care about the fact that we are very far away from drawing the real benefits out of these LLMs. We have not yet even battled the first plateau. And there are many coming.

The Powerhouse

Lots of discussions happening right now in relation to what people call nowadays #AI. Some sterile, some insightful, some driven by fear, some driven by enthusiasm, some driven by alcohol 😊

I am defining the ā€œpowerhouseā€ as being that technology that allows us to create the most value. In industry, in sales, in research, in general :).

In this light, during the information age, the one we are just barely scratching the surface of, there were multiple powerhouses that we can remember and talk about.

  1. The internet and the communication age powerhouse for example, that we have not yet exhausted. Not by far, in terms of potential productivity. This is something that we can understand easily. It’s not worth going into details here.
  2. The data-churn powerhouse. The ability to store, look-through, transform and stream data. I would argue that this is also easily understandable. However, we may stop a bit here and make a few points:
    • Transforming data and searching data, big data, involves something very resemblant to intelligence. It is not for nothing that we have a certain area of data exploration that is called business ā€œintelligenceā€. This though could be one of the first serious encounters with #AI. It is so long ago, that most people don’t even bother to call this #AI, although it is very #AI😊
    • Big data is the foundation of unsupervised learning models, so, Let’s not forget about this.
  3. The computer vision capabilities are somewhat taken for granted. Things like OCR, or face recognition (not identification).
  4. Then there is a generation of computer vision that really produces value, things like medical assisting software (for echo imaging, CT, MRI, and other types of imaging). You know, this is still #AI, some in the form of clever algorithms, some supervised learning, and some unsupervised learning. I think about this as yet another powerhouse.
  5. Then, there is despotic computer vision, things like face identification that can, and is really used at scale. We know about the use in China but let me tell you something about that: it is used the same here also. We’re just more discrete about it. And yes, I see this as yet another powerhouse. I know. Too many versions of the computer vision one.
  6. Another interesting powerhouse is the expansion of the same level of capabilities in other domains: – drug repurposing, voice synthetization, clever localization, etc.

All of this is #AI at its best. We basically off-load parts of our human intelligence to machines that can scale certain cognitive processes better than us, and that are more equipped when it comes to sensors.

We now have a new type of ā€œpowerhouseā€, we refer to it as LLMs. Some of the value prolific applications are becoming apparent right about now. Bear in mind that this is only the beginning. There is a whole new class of problems that can now become part of the information age. Many of these problems are not even known to us right now. This is happening because the link between humans and these artificial creations is now more intimate. It is language itself.

We have, basically, spent this short time that we have spent in the information age to:

  • teach computers to do calculations for us 😊
  • teach computers to remember
  • teach computers to communicate
  • teach computers to read
  • teach computers to see
  • teach computers to speak

None of these leaps were given up upon, they are all being accumulated in the problems that we solve using computers.

LLMs aren’t going anywhere. I promise you that, They are just bringing up possibilities. So, hearing all kinds of “informed opinions” stating that there is a great difference between expectations and reality with this advent of LLMs, and this is bad news for the entire industry is bull$hit.
Real bull$hit, not different than the one I expect it to be šŸ˜‰

Cheers!

Benchmarking the FFM

Sounds like a p0*n title. But I promise you it is not.

So I just ranted with amazement about some of the unexpected frontiers that are being broken, enterprise-wise, by FFMs together with RAG.

Endeavoring on a journey to adopt such a model for you, and integrating it in a RAG pattern is ultimately trivial. Strictly talking from a software engineering perspective.

However, from a data science perspective, you must be able to evaluate the result. How capable is your model in performing your scenario.

Evaluating LLM FFMs is a science in itself, but there are very relevant benchmarks that you can use in order to gauge any LLM. Let’s briefly explore a few, before focusing on how you could evaluate your RAG scenarioo (hint to the bolded ones).

  • MMLU (Massive Multitask Language Understanding)
    Generally used to identify a model’s blind spots. General cross-domain evaluation. Relevant evaluation in zero-shot, few-shot and “medprompt+” configs.
    Competititve threshold: medprompt+, > 90%
  • GSM8K
    Mathematical problem solving with training dataset. Multi-step mathematical reasoning benchmarking.
    Competitive threshold: zero-shot >95%
  • MATH
    Mathematical problem solving without training dataset. In exchange the MATH dataset can be used for training instead of evaluation. Or on a 1-shot configuration.
    Competitive threshold: zero-shot ~70%
  • HumanEval
    used for LLMs trained on code. Kind of the standard here.
    Competitive threshold: zero-shot >95%
  • BIG-bench (Beyond the Imitation Game Benchmark|
    Mining for future capabilities.
    Competitive threshold: few-shot + CoT ~ 90%
  • DROP (Discrete Reasoning Over Paragraphs)
    Currently 96k question for reference resolving in questions, to multiple input positions, with various operations over portions of the input positions. This benchmark measures the level of comprehensive understanding. it is split into a training set and a development set, making it ideal for evaluating a RAG capability.
    Competitive threshold: few-shot + CoT ~ 84%
  • HellaSwag
    Evaluation of generative capabilities for NLI problems. Human treshold is accepted at 95% for this one. TL;DR; if a HellaSwag benchmark scores 95% or more, then the generative capabilities of the model are human-like. This is what you want and nothing less.

I took the liberty to add some competitive thresholds, in case you need some orientation in this evolving landscape. Take these thresholds with a grain of salt. They are based on my experience and some research that has gone into this material. Nevertheless, there should be a red flag if you’re running a FFM benchmarked lower than these.

Back to the problem at hand, you r RAG setup can easily be evaluated with a combination of DROP benchmark and HellaSwag. HellaSwag should be as high as possible, and your DROP is able to measure how well your model can generate.

You can go an extra mile and take a look at the DROP dataset, and replace those paragraphs with paragraphs from your RAG scenario, and then run an benchmarking experiment. A little birdie told me that this is relevant if done correctly.

However., all the datasets, benchmarking algoritms (already implemented) are available with (various) open licenses. For example. you can find implementations and the datasets for ALL the benchmarks I have mentioned above at https://paperswithcode.com/

Happy new year!