Francesco Gadaleta PhD is a seasoned professional in the field of technology, AI and data science. He’s the founder of Amethix Technologies, a firm specialising in advanced data and robotics solutions. He hosts the popular Data Science at Home podcast, and over his illustrious career he’s held key roles in the healthcare, energy, and finance domains.
Francesco’s professional interests are diverse, spanning applied mathematics, advanced machine learning, computer programming, robotics, and the study of decentralised and distributed systems.
In this post, Francesco examines a recent claim made by software engineer François Chollet. François argues that LLMs are basically databases of programs. On the surface this claim might sound outlandish, but could he be right? Francesco investigates:
In a recent article, software engineer François Chollet (the mind behind the Python library Keras) put forward a bold interpretation of large language models (LLMs): he claimed that modern LLMs act like databases of programs. Chollet’s interpretation might on the surface sound strange, but I believe there are lots of ways his analogy is an astute one.
In this article, I’ll explore whether LLMs really are a new breed of database – and deepdive into the intricate structure of LLMs, revealing how these powerful algorithms exploit concepts from the past.
Before we explore Chollet’s analogy, let’s consider his take on the development of LLMs’ core methodology.
Chollet sees this development stem from a key advancement in the field of natural language processing (NLP) made a decade ago by Tomáš Mikolov and the folks at Google. Mikolov introduced the Word2vec arithmetic, which solved the problem of how to numerically compute text. The Word2vec algorithm works by translating words, phrases and paragraphs into vectors called ‘word vectors’ and then operates on those vectors.
Today, of course, this is a normal concept, but back in 2013, it was very innovative (though I should point out there have been many papers prior to 2013 in which the concept of embedding was very well-known to academics and researchers).
Back in 2013, these word vectors could do things like arithmetic operations. As a classic example, if you have the embedding of ‘king’ minus the embedding of ‘man’ plus the embedding of ‘woman’, it would have given you as a result the embedding of ‘queen’. Which is the semantically closest concept to this arithmetic operation. Thanks to these word embeddings, we could calculate arithmetic operations on words.
Fast-forward 10 years, and what’s changed? These algorithms have evolved into LLMs. In 2023, around March (at least officially), LLMs were announced, and we started using ChatGPT and the many other LLMs out there. And so, billion parameter models became the new normal.
The original 2013 Word2vec was, of course, incapable of generating fluent language. Back then, methodologies were much more limited. But at the core of our modern LLMs, there’s still a version of that Word2vec concept: of embedding tokens of words, or sub-words, or entire documents, in a vector space.
In his article, Chollet explores this correlation between LLMs and Word2vec. He points out that LLMs are perceived as ultra-aggressive models, trained to predict the next word conditioned on the previous word sequence. And if the previous sequence is large enough – if there is a lot of context, or a more complete context – the prediction of the next word can be increasingly accurate.
But in fact, Word2vec and LLMs both rely on the same fundamental principle: to learn the vector space that allows them to predict the next token (or the next word), given a condition; given a context.
The common strength of these two relatively different methods is that tokens that appear together in the training text, also end up close together in the embedding space. That’s the most important consequence when you train these models. In fact, it’s crucial, because if we lost that kind of correspondence, we wouldn’t be able to match numerical vectors to similar semantic concepts.
Even the dimensionality of the embedding space is quite similar. The order of 10 to the power of three or 10 to the power of four, the magic numbers obtained by trial and error – as LLMs increased in scale, that aspect really didn’t grow that much, because people noticed that by increasing the dimensionality of the embedding space, there were few improvements on the accuracy of the next estimated token or word. At the same time, if you increase the dimensionality of the vector, you must also increase the compute power to calculate all these things.
So LLMs seem to encode correlated tokens in close locations: there’s still a strong connection between training LLMs and the training part of Word2vec.
The outcome of all this, as Chollet says, is self-attention. Chollet sees self-attention as the single most important component in the transformer architecture. He defines it as: ‘…a mechanism for learning a new token embedding space by linearly recombining token embeddings from some prior space in weighted combinations which give greater importance to tokens that are already “closer” to each other,’ which I think is a great summary of how self-attention works.
A classic example is having the input sequence: ‘The train left the station on time.’
First, the self-attention mechanism would convert each word into its token vector. So, we would have the vector of ‘the’, the vector of ‘train’, the vector of ‘left’, and so on. And we put those vectors into metrics that we use to calculate the attention scores, because we want to calculate the attention score of each word against all the others.
Next (of course I’m oversimplifying this stage of the process) we get to the scores for the word ‘station’, and we will immediately compute that ‘station’ is usually close to the word ‘train’ and the word ‘left’. But ‘station’ and ‘train’ are almost always close to each other (at least according to the enormous amount of text that is used for learning such relations).
So, when we do this exercise considering this course for ‘station’, we would immediately see that it has a higher score with respect to ‘train’ than to all the other words.
If we extrapolate these scores from the sequence, and order the words that have the highest attention scores in the sequence, we would find ‘train’, ‘left’ and ‘station’, which gives us a hint about what’s going on in this sequence: that a train is indeed leaving a station.
Of course, we don’t know any further details because we have just one sequence and one phrase. But imagine you had several gigabytes or even terabytes of text in which all these correspondences of words and contexts would be scored according to the self-attention mechanism. Then you would have a very interesting context where vectors summarise very nicely whatever is going on in those sequences.
Two important things are happening here. Firstly, the embedding spaces that these topologies learn are semantically continuous . This means if we move slightly in the embedding space, we’re also slightly changing the corresponding token. Also, the semantic meaning that we humans assign to that group of words is changing slightly. And so, this semantically continuous space is actually a very interesting property to move continuously in the embedding space, which is made of numeric vectors, and also in the world of semantics, which is the world we humans live in. This means that we slightly change numeric parameters or vectors, and we slightly change the semantic meaning of those vectors.
Secondly, the embedding spaces they learn are semantically interpolative . This means that if you take the intermediate point between any two points in the embedding space, this third point represents the intermediate meaning between the corresponding tokens. In other words, if you cut somewhere in between two points, you get the semantical intermediate point as well (I’d argue this feature is a consequence of the first property of the semantically continuous; you can’t have the second without the first).
Why are these two properties so important? Because human brains work in a very similar fashion. Essentially, neurons that fire together, often wire together, so when the brain learns something new, it creates maps of a space of information (this is known as Hebbian learning). Indeed, there are strategies people employ (such as mind maps) that enhance this innate property of the brain to build maps of a space of information to learn fast, to remember concepts better, or to retrieve these concepts after years and years. So the way the brain learns is in fact just like what’s happening on our GPUs when we train LLMs, which respond to these two properties of semantic continuity and the semantically interpolative nature of the methodologies.
The LLM doesn’t only give you what you inserted, but also something more, something that is an interpolation of what you have inserted.
Of course, there have been huge improvements since Word2vec was introduced in 2013. LLM models aren’t just about finding a semantically similar word anymore. Modern LLMs have become way more powerful. For example, you can provide a paragraph, a document, or even a description of how your day went, and you ask in your prompt: ‘Write this in the style of Shakespeare’, and you would get, as an output, a poem that resembles Shakespeare’s poems. In this way, LLMs are enhancing the concept of Word2vec and taking it to new dimensions.
This brings us to Chollet’s intriguing interpretation of LLMs as program databases.
You might be wondering: what does a database have to do with a large language model?
Well, consider that just like a database, an LLM stores information, and this information can be retrieved via query, which we now call prompt.
However, the LLM goes beyond a basic database because it doesn’t merely retrieve the same data that you have inserted, but it also retrieves data that’s interpolative and continuous. So, the LLM can be seen as a continuous database that allows you to insert some data points and retrieve not only the data points that you have inserted, but all the other data points that might have been in between.
That’s the generative power of LLMs as a database: the LLM doesn’t only give you what you inserted, but also something more, something that is an interpolation of what you have inserted.
These results don’t always make sense, because the LLM still has the potential to hallucinate. But even when taking these glitches into account, Chollet has given us a very neat interpretation. The LLM could be analogous to a database, with this major difference: the capability of returning not just the data points that you have inserted, but also an interpolation of those (and therefore points you never inserted before).
The second major difference between LLMs and conventional databases is that LLMs don’t just contain data (of course, they do contain billions and billions of parameters now, even hundreds of billions with GPT4); LLMs also contain a database of programs.
So, the search isn’t only in the data space, it’s also a program space, in which you have millions of programs. The programs are a way to interpolate data in different ways. In other words, the prompt is essentially pointing to the most appropriate program to solve your problem.
For example, when you say, ‘rewrite this poem in the style of Shakespeare’, and provide the poem, the ‘rewrite this thing in the style of’ element is using a program key that’s pointing to a particular location in the program space of the LLM, while ‘Shakespeare’, and the poem that you input are the program inputs. The output – the result of the LLM – is the result of the program execution.
In summary: you point to the program, you provide some arguments or some inputs, the program executes and you receive an output. The model operates in the same way as giving inputs to functions in an imperative programming language and waiting for the result after compute.
We can also think of LLMs as machines that generate the next word given a context that is the statistical condition. This interpretation is easier to explain to non-technical people, even though there are many technicalities here that we’re skipping, of course.
There is one more thing that Chollet and I agree on: even though LLMs seem to be sentient machines that understand your prompt, this, of course, is not happening. LLMs do not understand what you’re typing. That prompt is a query, a way to search in a program space, just like a query you insert in Google Search, or a query you’re providing to a database. It’s in fact a key, just a complex one, that allows the database to search and retrieve the information you’re seeking or interpolate it for you.
That’s the advice that Chollet is giving to all of us: resist the temptation to anthropomorphise neural networks and in particular, LLMs. Instead, consider them a sophisticated kind of database.
Happy prompting!
Resist the temptation to anthropomorphise neural networks and in particular, LLMs. Instead, consider them a sophisticated kind of database.