What

Bang!

I rarely read business books. Extremely rarely. However, the decade-old “The Decline and Fall of Nokia” by (mainly) David J. Cord has something biographic to it. It’s like a detailed painting of a crowd: adds clarity and context to an external viewer.

Nevertheless, I consider, in my own view, that it contains a lot of parallels (or rather, counterexamples) about what is happening nowadays, with the advent of this generation of AI. There are many tech businesses that have to suffer, adding fuel to this fire seems to be this new tech.

What tech? you say ?

This one:

source
LLM Ability to complete an expert-level long task

What’s that you say? That is the ability of LLMs (so not all AI) to complete long tasks. To read: If you present a complicated problem to an expert, how much time does the expert spend in order to solve it. In summary, the latest commercially available LLM, is now solving 2 hours worth of expert-level in an instance. (95C.I is 1 hour and 10 minutes to 4 hours). In Health, for example, the number is much, much higher. This is, without a doubt an exponential beggining. It started with X’s grok4, and is thorowly confirmed by OpenAIs GPT5.

To put this in perspective: the jump from o1 to o3 is as big as the jump from o3 to gpt5. In the four months it passed from April…

The blindspot.

This is the reality that everybody has to swallow. In an analogy with what happened with Nokia, there is a counterintuitive (fr some) blindspot that is manifesting for businisess. This only comes to confirm my predictions (to be fair, not mine to start with, but predictions of smarter people) that: The boost in productivity will manifest itself only for the supperior level of expertise. The best people in a field will be able to milk these tooks better. Considering they assume the stool and start putting in the effort to rub the tities 🙂

This is also the big blindspot. Many organizations seem to think that because high expertise is easier to find, you can mix it with the low expertise of a random team and boost productivity, when reality is exactly the other way around. In all this turmoil we’ll find an increasing number of Nokias. Bigger or smaller, but still, blind.

Let’s hope more and more business leaders understand this aspect before the exponential growth of this capability, turns into a 1/T law for the business.

Bang!

The Mind

As part of an effort to increase and infuse technology, and related practical abilities into the daily lives and activities of future graduates, a certain university in Romania tries to pilot a program. I am a part of this program, designing two of future courses and running the pilot with them. This activity got me in a meditative state about the current state of Technology (AI and s**t)

My long-lasting statement is that the current state of the new wave of technology is just beginning to show its potential, and the future has way more surprises ahead.

Things change fast, and we have reached a state of profound revolution in knowledge. Last time I gave an argument against the infamous plateau of generative models. Or against transformers in general. I weighted against the problem of “hierarchies of scale”, and tried to frame it in a story.

The current hypothesis

My current argument is that it seems that we have managed to tame certain bits and pieces of the way our own mind figure out data. I do not believe that we know exactly what part of our mind we managed to tame, but we tamed it pretty well. It’s like when we domesticated the dog. We really didn’t know what a dog is, but we used it pretty clever.

Dive into my mind

Let’s take the example of AlfaFold.

The long story short is: There is an active research field in biology and chemistry trying to figure out protein structures. This is extremely important for various fields, like medicine, just to give a grasping example.
Researches have found several hundred structures via a classical exhaustive metod (x-ray cristalography).
Then they tried to move it into the power of computing, with various sub-mediocre results.
Then someone (Prof. David Baker) tought to make a game (called Foldit) where gamers will actually help predict a structure *(following some rules, dooh). Spectacular results.
Then someone tought to do train some DNNs in a reinforcement setup to play the game. Very good results.
And then, (2018-2020) the AlfaFold team threw Transformers at it. In a very clever setup.
Finally Awarding them the nobel prize in chemistry, last year. For discovering the structure of 200 million proteins. And more. I leave the details to be researched independently.

This helped the field of medicine spitting out a vaccine for mallaria, and few drugs targeting antibiotic-resistant bacteria, etc.

That is why

I think what is happening is virtually unimaginable.

Some of my somewhat intimate friends know that one of my predictions is that the medicine will benefit from a new branch, or a new approach, where the clinical model will be challenged. I am not arguing against the clinical model and the clinical epistemology, but bear in mind, that medicine will change dramatically over the course of next years. And it is not because of AlfaFold only. Things started to converge once technology started to offload knowledge, things started to converge once usecases like “drug-repurposing” cristalized their way into mainstream medicine.

And that is why I think, once the dust settles on the current crisis in the IT world, the future looks challenging, and way bigger than it was before.

And finally, that is why I think we have reproduced a valuable piece of the Human Mind at scale.

Cheers!

Decoupling of scales and the plateau

It is a well known fact that LLMs will reach multiple plateaus during their development. Right about now there should be one. Even Bill Gates is talking about it. Even Yan LeCun. (look it up on X). Altough, to be honest, I’m not really counting LeCun as an expert here. He’s more like a Mike Tyson of the field ( What? Too much? Too Early? 🙂 )

This time, the reason behind these plateaus is not the lack of technology (as it was the case with previous AI waves (50s, 80s – the LeCun Era – 2000s).
This time, the reason is extremely profound and extremely close to humans. It is concerning an interesting aspect of epistemology: the decoupling of scales issue.
What this means is, in a nutshell is, that you cannot use an emergent level as an indicator for an underlying reality without stopping on the road and gathering mode data (experiments and observations) , and redoing your models entirely.

My argument is that our models, now, at best, they do clever “Unhobbing” (a basic feedbackloop that somewhat emulates a train of tought by iterating answers)

We have not yet seen, at all a real bake-in of a new “underlying reality model”.

I don’t care about AGI, and when or if we will truly reach general intelligence! I care about the fact that we are very far away from drawing the real benefits out of these LLMs. We have not yet even battled the first plateau. And there are many coming.

The Powerhouse

Lots of discussions happening right now in relation to what people call nowadays #AI. Some sterile, some insightful, some driven by fear, some driven by enthusiasm, some driven by alcohol 😊

I am defining the “powerhouse” as being that technology that allows us to create the most value. In industry, in sales, in research, in general :).

In this light, during the information age, the one we are just barely scratching the surface of, there were multiple powerhouses that we can remember and talk about.

The internet and the communication age powerhouse for example, that we have not yet exhausted. Not by far, in terms of potential productivity. This is something that we can understand easily. It’s not worth going into details here.
The data-churn powerhouse. The ability to store, look-through, transform and stream data. I would argue that this is also easily understandable. However, we may stop a bit here and make a few points:
- Transforming data and searching data, big data, involves something very resemblant to intelligence. It is not for nothing that we have a certain area of data exploration that is called business “intelligence”. This though could be one of the first serious encounters with #AI. It is so long ago, that most people don’t even bother to call this #AI, although it is very #AI😊
- Big data is the foundation of unsupervised learning models, so, Let’s not forget about this.
The computer vision capabilities are somewhat taken for granted. Things like OCR, or face recognition (not identification).
Then there is a generation of computer vision that really produces value, things like medical assisting software (for echo imaging, CT, MRI, and other types of imaging). You know, this is still #AI, some in the form of clever algorithms, some supervised learning, and some unsupervised learning. I think about this as yet another powerhouse.
Then, there is despotic computer vision, things like face identification that can, and is really used at scale. We know about the use in China but let me tell you something about that: it is used the same here also. We’re just more discrete about it. And yes, I see this as yet another powerhouse. I know. Too many versions of the computer vision one.
Another interesting powerhouse is the expansion of the same level of capabilities in other domains: – drug repurposing, voice synthetization, clever localization, etc.

All of this is #AI at its best. We basically off-load parts of our human intelligence to machines that can scale certain cognitive processes better than us, and that are more equipped when it comes to sensors.

We now have a new type of “powerhouse”, we refer to it as LLMs. Some of the value prolific applications are becoming apparent right about now. Bear in mind that this is only the beginning. There is a whole new class of problems that can now become part of the information age. Many of these problems are not even known to us right now. This is happening because the link between humans and these artificial creations is now more intimate. It is language itself.

We have, basically, spent this short time that we have spent in the information age to:

teach computers to do calculations for us 😊
teach computers to remember
teach computers to communicate
teach computers to read
teach computers to see
teach computers to speak

None of these leaps were given up upon, they are all being accumulated in the problems that we solve using computers.

LLMs aren’t going anywhere. I promise you that, They are just bringing up possibilities. So, hearing all kinds of “informed opinions” stating that there is a great difference between expectations and reality with this advent of LLMs, and this is bad news for the entire industry is bull$hit.
Real bull$hit, not different than the one I expect it to be 😉

Cheers!

I’m worried. Formally. I am formally worried!

I have the privilege of knowing some very, but verry smart people that have the experience and the record to show for it. When they design something, they can afford to be a bit loose with their using of formal specification.

Why?
Because usually, when you ask them questions like:
What did you consider for necessary capabilities for geographic data synchronization in this transaction system? they will give you an answer in the lines of: “We accounted for a distributed algorithm with an external-clock synchronization, supporting Suzuki-Kasami for MeX. Same exclusion used homogenously”.

Your goal as an architect that takes decision on implementing distributed solutions for critical systems is to be able to provide this answer. If your name is not, i don’t know, Leslie Lamport, or if you don’t hold an IQ of over 150, you have to be involved in formal specification. “Cloud does not fix stupidity”, is one of the famous quotes in this industry. And when talking about critical systems in cloud, the “stupidity” threshold in terms of IQ is pretty damn high.

What worries me is that I continue to see 99%+ of critical systems being delivered with a design based on the gut feeling of people holding an under-the-threshold-IQ.
And then I get to see them grow and be operated. Holy f**k that’s a mess. Listen, this is normal, it’s not your IQ that I am blaming, is the arrogance. Just find the budget to force yourself to go through the process of doing formal specification. You will be forced to think about the problems that you cannot foresee.
The people that are paying for your design don’t know shit anyway about what is it that you do all day. And you know it. But that’s a whole different discussion.

I know that the time used for formal specification is very valuable, but hey, that’s why your system is critical.

There’s one anonymous quote here that I like: “The money you make being the first one delivering a critical system quickly turn to dust when the critical system fails.” – Chinese Proverb (I kid, of course!)

17.4 on the Richter scale

Well, iOS 17.4 happened.

There’s a lot of hype around the EU specific changes of the iOS core supporting multiple stores and multiple payment solutions. As interesting as that is, I don’t care too much about it. It’s old news since some years ago.

It’s another point that is interesting to me:

Carplay

Let’s review a few features of the new carplay:

climate controls piped through carplay
TPM
Charge monitoring and management for EVs
Some vehicle settings
Trip management

How would you rate the driver facing software in a current era, digital dash car? (EV, partial EV, or conventional) On a scale of 1 to 10?

I’d give a maximum of 3. Let me put it this way. If it is a 5 on usability, it’s a 0.5 in features, or the other way around. Ok, maybe 4 in a top-class 90k+ car. But, not more. Be honest. As proud as you are of your car 🙂

Once a tech giant in software starts touching the software in a car, things can change very quickly.

The car industry’s supply chain does not have the know-how to build driver facing software. This is not my opinion, unfortunately, it is fact. And listen, it is not their fault. That’s how things are set-up.

What software do they know how to build? Mission-critical. That’s all.

Who fills in the gap? Well, lately, the apple carplay release that came with iOS 17.4. A huge foot in the mouth for this exact industry.

I can only imagine the VAG cutting in half all the future feature plans for the digital dash and driver-facing software, and giving it to apple and google. For free.

If things go on this way, and they seem to do, the next crisis of software developer human resource is going to be fueled, at least in part, by the drastic reduction coming from the automotive industry’s useless branch of driver facing software. Yes, hate me for the speech.

I never saw the infamous “goodbye screen” of the new carplay. If this feature is still there, I see it as a nice subliminal message easter egg. It’s addressed to the next dying industry.

There will be other battles, in the realm of influence that are going to take place, but still, the ground is shifting, 17.4 on the Richter scale.

F**k!

Benchmarking the FFM

Sounds like a p0*n title. But I promise you it is not.

So I just ranted with amazement about some of the unexpected frontiers that are being broken, enterprise-wise, by FFMs together with RAG.

Endeavoring on a journey to adopt such a model for you, and integrating it in a RAG pattern is ultimately trivial. Strictly talking from a software engineering perspective.

However, from a data science perspective, you must be able to evaluate the result. How capable is your model in performing your scenario.

Evaluating LLM FFMs is a science in itself, but there are very relevant benchmarks that you can use in order to gauge any LLM. Let’s briefly explore a few, before focusing on how you could evaluate your RAG scenarioo (hint to the bolded ones).

MMLU (Massive Multitask Language Understanding)
Generally used to identify a model’s blind spots. General cross-domain evaluation. Relevant evaluation in zero-shot, few-shot and “medprompt+” configs.
Competititve threshold: medprompt+, > 90%
GSM8K
Mathematical problem solving with training dataset. Multi-step mathematical reasoning benchmarking.
Competitive threshold: zero-shot >95%
MATH
Mathematical problem solving without training dataset. In exchange the MATH dataset can be used for training instead of evaluation. Or on a 1-shot configuration.
Competitive threshold: zero-shot ~70%
HumanEval
used for LLMs trained on code. Kind of the standard here.
Competitive threshold: zero-shot >95%
BIG-bench (Beyond the Imitation Game Benchmark|
Mining for future capabilities.
Competitive threshold: few-shot + CoT ~ 90%
DROP (Discrete Reasoning Over Paragraphs)
Currently 96k question for reference resolving in questions, to multiple input positions, with various operations over portions of the input positions. This benchmark measures the level of comprehensive understanding. it is split into a training set and a development set, making it ideal for evaluating a RAG capability.
Competitive threshold: few-shot + CoT ~ 84%
HellaSwag
Evaluation of generative capabilities for NLI problems. Human treshold is accepted at 95% for this one. TL;DR; if a HellaSwag benchmark scores 95% or more, then the generative capabilities of the model are human-like. This is what you want and nothing less.

I took the liberty to add some competitive thresholds, in case you need some orientation in this evolving landscape. Take these thresholds with a grain of salt. They are based on my experience and some research that has gone into this material. Nevertheless, there should be a red flag if you’re running a FFM benchmarked lower than these.

Back to the problem at hand, you r RAG setup can easily be evaluated with a combination of DROP benchmark and HellaSwag. HellaSwag should be as high as possible, and your DROP is able to measure how well your model can generate.

You can go an extra mile and take a look at the DROP dataset, and replace those paragraphs with paragraphs from your RAG scenario, and then run an benchmarking experiment. A little birdie told me that this is relevant if done correctly.

However., all the datasets, benchmarking algoritms (already implemented) are available with (various) open licenses. For example. you can find implementations and the datasets for ALL the benchmarks I have mentioned above at https://paperswithcode.com/

Happy new year!

RAG(e) Against the Machine

Formulating the latest LLM leaps as foundation models has opened a box of infinite possibilities. If you were living on the Earth for the past 24-36 months, this is not news.

The RAG Pattern now made its way into very niche areas.

But first, a little (his)story

Remember the era of the chatbots? Then the era of “synthetic chatbots”. You know, the ones that answered the phone when you wanted to solve a problem with your (bank / xSP)? Those are (or maybe were) just clever expert systems, covered by capable voice synthesizers. Yes, an expert system is still AI, the voice synthesizers are nowadays also built with a sort of generative AI model.

You know why they are still around ?

Because they make a difference. Dollar-wise.

Context

Foundational LLMs used with RAG quickly found their way into the mode technical aspects of human communications. Engineering that is. IT&C Engineering to be more precise.

For example, operation centers, including SOCs, very quickly adapted to this new reality and implemented RAG out of the box for second and third level support. Basically, what this means, in a nutshell, is that when you build a support team (second and third level) for a product, team members DO NOT have to spend time reading any type of written manuals. Zero.

Another really cool example is in cybersecurity. You can no have (and you do have) solutions in place that do “assume breach”-level monitoring, and you can query their status using natural language. This is achieved by indexing definitions of cybersecurity concepts together with the output of the smart monitoring tools. This is already pretty cool. But this is not the main subject of incursion.

The intrigue

I got into a discussion with one of my friends the other days. He is close to some content moderation ecosystems.

For a bit of a context, content moderation is a business where workers are subject to an EXTREME cern-rate. Look it up for yourself. The average will blow your mind.

Training employees (to use the tools), offering first level support for the software ecosystem serving the content moderators is an operation that has the second impact in cost for this business.

Well, they have eliminated the need to:

do technical training for any employee (new, old, whatever)
there is no first level support anymore

Why, because RAG can.

I’ve got some neat insights on how they’ve done it, but this is another subject.

This is big. This is crossing barrier. Completely eliminating first level support for an operation where this accounts for a lot of cost is big. Even if their audience is fairly technical, it is still a big achievement.

I can see a future where…

There will be an acceptance criteria for various vendors inside an enterprise ecosystem that your documentation must be capable of integration with their RAG solution because if not, the operational cost is 50% higher.

Hey, I’ve seen mission-critical workloads that have shit (the stinky kind) documentation.
Not once.

It’s not that this happens that intrigues me, but the speed at which this is happening. Or maybe I am too old already. It may have something to do with the fact that cloud providers already have the “RAG SDK” out and ready. Well, this is good news after all.

Peace.

The sharp tool

Cloud is long gone my friends, long gone. Cloud computing is now just a tool for a ubiquitous computing society.

Allow me to clarify: simply, there’s absolutely no aspect of our contemporary society that is possible without constant, continuous, invasive and maybe pervasive involvement of computing. The so-called “pervasive computing” (Eva Nieuwdorp). I’m not going to discuss anything about the so-called pervasive dimension of ubiquitous computing, but I am going to leave the concept printed here. It fits well with one of the points I am going to make here.

Ubiquitous computing is the reality – so, not only the concept – where computing appears anytime and everywhere, anywhere. Let’s let this sink in for a moment.

There are many technical moving factors that make ubiquitous computing possible, for example:

Hardware miniaturization
Hardware affordability
Software affordability
UI and UX leaps
Communication infrastructure affordability
(so many more)

If we are to go into psychological moving factors, we would totally open-up a whole different world, so, we are not going to do that.

The reality is that we find ourselves in a compute-omnipresent society for a long time. There is nothing really new about all of this.

The TNT that actually placed the true ubiquitous in ubiquitous computing was the dawn of cloud computing. The ability to have computing as a service, any computing. The ability to “order random computing”. The ability to to LEGO with computing units on a global playground. Once this ability was available to the masses everything changed. Actually, the plain analogy with LEGO is not 100% right. It is like you always had the LEGO pieces, and now, by some magic, you can order magic part, place it inside your LEGO structure and it suddenly turns to reality. Magic and dangerous at the same time. Like black magic dust.

At the beginning…

There was, and maybe there still is, a lot of hype around cloud computing. You know, being able to rent and integrate the best of the best of a finished product in terms of computing is always going to be hyped-up.

At the beginning there were a lot of arguments about the “as-a-service” versus “ownership and independence” of computing “stuff”.

At the beginning there were a lot of arguments around the reliability of “the cloud”

At the beginning there were a lot of arguments around the security of it all.

At the beginning there were a lot of arguments about the cloudonomics of it all. Economies of scale, they said.

And now…

Now, the ownership problem was resolved. “People” asked, and the “people” have obtained the possibility of getting everything in and out of the “public cyberspace” as they please. We can run today an app in the cloud, and tomorrow in your datacenter, together with data, traffic, whatever.

Now, the reliability got to, I think, almost the best you can have. Proven in war.

Now, the security can no longer take place outside the cloud.

The twist

The last standing post is the cloudonomics. The turn here is that businesses moving to the cloud understood that the cost factor has to be balanced by the possibilities that you now have, post adoption. Optimizing the cost has to be balanced with getting more value for your buck.

If your business doesen’t find a way to benefit out of the fact that it can now be globally available, you have failed a bit

If your business doesen’t find a way to benefit out of the fact that it can now fairly easy make use of outstanding technology (AI/ML, BI, data archiving, data handling, etc) you have failed a bit more

If your business doesen’t find a way to benefit out of the fact that it can now stay available / grow in an elastic manner, you have failed a bit more.

We could go on forever.

It is about how “your business in the cloud” should incorporate computing in general, and contribute with some weight in developing the ubiquitous computing. That’s the sweet spot. Your cloud adoption is 100% successful when you get more ubiquitous.

The sharp tool

Cloud computing is to ubiquitous computing what a hammer is to a nail. This is not a hardware vs. software analogy, this is not a hardware vs. hardware analogy, this is not an infrastructure vs. service analogy. It is just a tool analogy.

If you know you are going to be hit by a hammer, you’d better be sharp at the other end.

P.S. building sharp tools is going to be the subject of the next topics.

Cheers.

Ranting about surveillance

In Romania, there’s a lot of law being in either in debate, or passing passed about giving the authorities direct abilities to get all e-data. E-mails, IMs, mandating the vendors to hand unencrypted data out. Please read this again.

Now, I want to lightly share some of my experience in working for some top cybersecurity ‘consulting’ companies. (lightly, because of NDAs)

Now, the layout is as follows: justice, law enforcement, and ultimately governments are mandating surveillance when justified. Alright Good.
The surveillance is happening anyway (phone taps, physical tracking, e-mails, IMs, and whatnot) with help from various agencies that are specialized in doing that.
The level of expertise that various govt. agencies have in terms of electronic surveillance is not always up-to-date. This is to say that their capabilities are limited. This is normal. When they face a situation where they can’t pursue a surveillance task, they outsource. There are cybersecurity companies that offer such services. These companies have cybersecurity researchers that are on top with various 0Days and the corresponding exploits and they master this.

How do I know this? I was one of these guys that offered cybersecurity research services for such a company. Repeatedly. Actually only two times, for two different companies. So not a lot of experience here. Just enough. I’m not going back there!

So, what’s my problem?

Let me guide you through an example:

Assume there’s a mandate that asks for IMs sent by the suspect. This is currently achieved, usually, by compromising the user’s device (phone, laptop, PC, MAC, whathef**kever) with some malware that is usually designed by one of these contracted companies. Surveillance happnes ON THE COMPROMISED DEVICE ITSELF.

Surveillance does not happen on the ‘encrypted wire’, or on the IM vendor’s infrastructure, but on the TARGETED DEVICE.

Now, suppose this new law passes, that will mandate the IM service providers to hold unencrypted data (or hold encryption keys) FOR EVERYBODY, ‘just in case’ a mandate is thrown away.

Do you see the problem yet ?

Jesus Christ, we live in a f***ed up world !!!

Mid-workshop surprise!

Something weird and nice happened to me today.

Several times a year, I accept requests for guiding cybersecurity workshops for various clients. Usually they fall into the category of web application security, or software development security. Not more than several (max 5) times a year because this will greatly impact my performance in other areas.

So one client requested a web applications security workshop that must be focused on OWASP guidelines. It is awesome for me. I always like OWASPs content. Sometimes, I even have the privilege of contributing to it. When I provide this service, I never prepare exhaustive slides for presenting an already well established material, such as the one from OWASP. I just go on the website and work with that as a prequel to my deep examples.

So what happened? Mid-workshop, the OWASP Top 10 W.A.S.R. changed. Bam! “Surprise M**********R!” Deal with that!

Now, during these events, I usually bring a lot of my experience in addition to whatever support material we are using. Actually, this is why someone would require guidance in going through a well-established and very well built security material, such as the one from OWASP.
When I talk and debate, and learn together with an audience about cybersecurity topics, I always emphasize things that I consider to be insufficiently emphasized by the supporting material. I say emphasized and not detailed, and please be careful to consider this difference.

Insufficiently emphasized topics

Traditionally, OWASP’s guidelines and material did not emphasized enough, in my humble opinion:

The importance of using correct cryptographic controls in the areas of: authentication and session management, sensitive data exposure, insufficient authorization
Insecure design in the areas of: bad security configuration, injection problems, insecure deserialization
Data integrity problems. Loop to #1.

I usually spend spend around 10-11 hours from 16[or more] hours workshop on the three topics above. Very important stuff, and, traditionally overlooked in most teams that I interact with.

What changed?

It was a nice surprise to see that in the new TOP 10 W.A.S.R. OWASP included my three pillars and emphasized concepts the same way I like to do it. They even renamed sections according to my preference. Like the second position (A2) is now called Cryptographic Failures. AWESOME!
They explain stuff in a more holistic manner, as opposed to just enumerating isolated vulnerabilities. AWESOME!
Finally. It was an extremely good argument for the team that I was leading, about the way I spent my time on the three topics. I felt good about them 🙂
Alraaaaight, I felt good about myself too!

Oh, and P.S: For the first time in.. what now, more than a decade (?!) OWASPs Top 10 W.A.S.R. does not have the top position occupied by injection problems. Either the web has grown exponentially again, or we have escaped a boundary. The boundary of absolute stupidity 🙂

Cheers!