Francesco Gadaleta PhD is a seasoned professional in the field of technology, AI and data science. He’s the founder of Amethix Technologies, a firm specialising in advanced data and robotics solutions. He hosts the popular Data Science at Home podcast, and over his illustrious career he’s held key roles in the healthcare, energy, and finance domains.
In this post, Francesco takes us on a deep dive into the realm of natural and artificial languages. He explores the potential applications of LLMs when used in conjunction with artificial language, considering the ways LLMs are progressively handling languages more effectively. LLMs, Francesco argues, could transcend the limitations of context-free grammars (CFGs):
The realm of language, encompassing both natural and artificial forms, presents a fascinating paradox. Natural languages like English or Mandarin boast remarkable flexibility and adaptability, yet this very quality can lead to ambiguity and misunderstandings. In contrast, artificial languages, designed with specific purposes in mind, prioritise clarity and structure, often at the expense of nuance and expressiveness. This article delves into the potential of artificial intelligence (AI) and large language models (LLMs) to bridge this gap, harnessing the strengths of both artificial and natural languages.
We begin by exploring the fundamental differences between these two language categories: natural and artificial languages.
NATURAL LANGUAGES , products of organic evolution through human interaction, are complex and ever-changing, reflecting diverse communication needs. Unlike their meticulously crafted counterparts, natural languages like English and Spanish evolved organically over centuries through human interaction. This organic development led to intricate and sometimes ambiguous structures, allowing them to serve diverse communication needs. Imagine the difference between a language sculpted with a specific purpose in mind, like the clarity and precision of a computer programming language, and the rich tapestry of a natural language woven from the threads of human experience. That’s the essence of the distinction between natural and artificial languages.
ARTIFICIAL LANGUAGES are in fact deliberately created with specific goals in mind. Whether it’s the deterministic nature of first-order logic ¹ , the efficiency of Python or Rust, or the global aspirations of Esperanto, these languages prioritise clarity and structure to avoid ambiguity. Think of them as tools designed for specific tasks, in contrast to the naturally flowing and multifaceted nature of the languages that people speak every day.
Some artificial language examples are: the languages used in computer simulations between artificial agents, robot interactions, the messages that programs exchange according to any network protocol, but also controlled psychological experiments with humans.
The presence of ambiguity in natural languages, while absent in most artificial languages, stems from several key differences in their origins and purposes. Natural languages have developed over centuries, leading to flexibility and adaptability, but also to inconsistent and sometimes imprecise rules.
They have served diverse communication needs, including conveying emotions, establishing social rapport and expressing creativity. Such multifunctionality leads to subtleties and shades of meaning that are not always explicitly defined. Moreover, natural languages heavily depend on the surrounding context, including the speaker’s intent, cultural background, shared knowledge, etc. This reliance on context can lead to ambiguity too, as the same words can have different interpretations in different situations, cultures, tribes or geographically separated locations.
In contrast, artificial languages are designed with specific goals in mind, like providing clear instructions or facilitating precise communication. This focus on clarity and efficiency often leads to strict rules and unambiguous structures. These other types of language usually have a specific and well-defined domain of application. This allows them to focus on a smaller set of concepts and relationships, reducing the potential for ambiguity. The controlled vocabulary that typically characterises artificial languages eliminates the confusion that can arise from the synonyms, homophones, and other features of natural languages. As an example, homophones, words with the same sound but different meanings (e.g., ‘bat’ – flying mammal vs baseball bat) are present in natural languages but not in artificial languages designed to avoid such confusion. Natural languages often rely on implicit information conveyed through context, tone of voice, or facial expressions. This is yet another source of ambiguity, as the intended meaning may not be explicitly stated. In contrast, artificial languages strive to be explicit and avoid reliance on implicit information. Natural languages constantly evolve and change over time, leading to multiple interpretations of words or phrases that may have had different meanings in the past. Artificial languages, on the other hand, are typically designed to be stable and resist change.
It’s important to note that not all artificial languages are completely unambiguous. Some, like logic formalisms, have well-defined rules that prevent ambiguity. Others, like programming languages designed for natural language (like syntax) can still have some level of ambiguity due to their attempts to mimic natural language constructs. Overall, the inherent flexibility and context-dependence of natural languages, in contrast to the focused and controlled nature of most artificial languages, are the primary factors contributing to the presence of ambiguity in the former and its absence in the latter.
The recent surge in large language models (LLMs) has unfortunately led to a widespread misconception that these models can truly understand and learn about the world through language. This belief, however, directly contradicts the inherent limitations of natural languages themselves when used as a sole source of knowledge about the world.
While current LLMs can generate seemingly humanlike responses based on context, they are far from truly understanding the world they navigate. Their ‘common sense reasoning’ relies on statistical patterns, not actual comprehension.
This highlights a fundamental limitation: natural language itself is inherently insufficient for comprehensive understanding. Unlike humans and other intelligent beings who draw upon diverse inputs (including non-linguistic and non-textual information), LLMs are confined to the realm of language. It’s akin to assuming complete world knowledge solely by reading an entire encyclopedia – language simply cannot capture the full spectrum of knowledge. It serves as just one, limited form of knowledge representation.
From a technical standpoint, both natural and artificial languages represent information using specific schemas. These schemas describe objects, their properties, and the relationships between them, with varying levels of abstraction. Consider the difference between reading a poem and playing or interpreting music. The ‘interpretation’ required for music adds a significant gap between the written and performed versions, conveying information beyond the literal text. This gap is less pronounced with artificial languages, particularly programming languages, as we will explore later. Language inherently transmits information with limited capacity. This is why humans rely heavily on context in communication. Often, this context is implicit or already understood, reducing the need for explicit communication.
As a fundamental concept in formal language theory, languages are classified as follows:
1. REGULAR LANGUAGES: These are the simplest type of formal languages and can be recognised by finite automata. They are less expressive than context-free languages.
2. CONTEXT-FREE LANGUAGES: These languages can be recognised by pushdown automata and are more expressive than regular languages. Context-free grammars are widely use for describing the syntax of programming languages.
3. CONTEXT-SENSITIVE LANGUAGES: These languages can be recognised by linear-bounded automata, and they are more expressive than contextfree languages. Context-sensitive grammars allow for more intricate and nuanced language structures.
The hierarchy reflects the increasing computational power needed to recognise or generate languages within each category. Context-sensitive languages are indeed more powerful than context-free languages, and context-free languages are more powerful than regular languages.
Practically no programming language is truly context-free. However this limitation is not really important due to the fact that every programming language can be parsed or it wouldn’t be useful at all. This means that any deviations from context freeness can and will be dealt with by the grammar of the language. The grammar of a language is the set of rules that can produce strings that belong to the language. Strings that violate the grammar, cannot be generated by such grammar (by definition). Hence, such strings would not belong to the language (by definition). An example of a tiny grammar representing a typical construct of programming languages is described as follows:
An example of a grammar to represent one particular construct of a programming language like Rust is provided below. Generally speaking, a grammar is formed by symbols that can be terminal or nonterminal. Terminal symbols are symbols that terminate the string (called production) and after which there are no other symbols. Non-terminal symbols, on the contrary, are symbols that are followed by other symbols. For instance, to declare an integer variable in a language like Rust, a programmer would type the following string:
let x: int = 42;
Such statement contains non-terminal symbols like let and int and terminal symbols like ;
Another important concept of grammars is the concept of production rules . Production rules represent a formal description of how symbols in a language can be replaced or transformed into other symbols. Technically, what is on the left side of the -> produces the expression on the right.
Something -> SomethingElse
Producing means transforming the string on the left into the string on the right. Here’s an example of a contextfree grammar and some productions to generate a simple variable declaration in Rust (and other similar imperative programming languages like C/C++, Python, Typescript, Go, Java, etc.):
GRAMMAR:
Non-terminals: S, T, V Terminals:
let , : , mut , identifier , int , float , bool
PRODUCTIONS:
1. S -> let V : T (Start symbol with declaration, variable, colon, and type)
2 . V -> identifier (Variable can be an identifier)
3. T -> int | float | bool (Type can be integer, float, or boolean)
4. V -> mut identifier (Optionally, variable can be declared mutable)
EXAMPLE USAGE:
This grammar can generate simple variable declarations like:
• let x: int (integer variable)
• let mut y: bool (mutable boolean variable)
• let name: String (using standard library String type)
LIMITATIONS:
This grammar is a very simple toy and only generates basic variable declarations. It cannot handle more complex features like functions, loops, or control flow statements which would require additional productions and non-terminals. Real grammars have several hundreds of production rules and symbols in order to express and generate all constructs of a programming language in a non-ambiguous way.
In the study of formal languages and automata theory, grammars are used to describe the structure of a language. Context-sensitive, context-free, and regular grammars are three types of grammars that differ in the way they generate strings.
A regular grammar generates a regular language, which is a language that can be recognised by a finite state machine. As for regular languages, regular grammars are the simplest type of grammar and are often used to describe simple patterns in language. For example, the grammar that generates the language of all words that start with the letter ‘a’ and end with the letter ‘b’ is a regular grammar.
This grammar can be represented as the following production rules:
S -> aB,
B -> b.
CONTEXT-FREE GRAMMARS generate a context-free language, which is a language that can be recognised by a pushdown automaton. Context-free grammars are more powerful than regular grammars and can describe more complex patterns in language. For example, the grammar that generates the language of all matching pairs of parentheses is a context-free grammar.
This grammar can be represented as S -> (S)S | ε.A
CONTEXT-SENSITIVE GRAMMARS generate a contextsensitive language, which is a language that can be recognised by a linear-bounded automaton. Context-sensitive grammars are even more powerful than context-free grammars and can describe more complex patterns in language. For example, the grammar that generates the language of all palindromes is a context sensitive grammar.
This grammar can be represented as S -> aSa | bSb | ε
In human language, regular grammars can describe simple patterns like words that start or end with a certain letter. Context-free grammars can describe more complex patterns like sentences with matching pairs of parentheses. Contextsensitive grammars can describe even more complex patterns like palindromes.
While all three grammar types can generate language, their limitations increase when dealing with the complexities of natural language. Context-free grammars offer a good balance between power and practicality for many applications, while context-sensitive grammars offer the most expressive power but come at a higher computational cost and complexity.
The grammars discussed are examples of formal languages used to describe the structure of natural languages. While these formal languages can capture specific aspects of natural languages, they cannot fully replicate the full complexity and nuances that arise from the organic evolution and diverse uses of natural languages. Indeed, particularly in spoken languages, grammar rules are often bent rather than strictly adhered to. Slang and other language variations not only reflect the culture of the speakers but also serve as highly effective means of communication.
Context-free grammars (CFGs) are a type of formal grammar used to describe the structure of languages. They offer a balance between expressive power and practicality in various applications, making them important for some of the properties that are reported below:
● Rules and Symbols: A CFG consists of a set of rules that rewrite symbols. These symbols can be terminals (representing actual words or punctuation) or non-terminals (representing categories of words or phrases).
● Hierarchical Structure: The rules define how non-terminals can be replaced by sequences of terminals and other non-terminals, capturing the hierarchical structure of languages. (e.g., a sentence can be composed of a subject and a verb phrase, which can further break down into nouns and verbs).
● Context-Free Replacement: Importantly, the replacement of a non-terminal with another symbol can happen regardless of the surrounding context. This means the rule can be applied anywhere the non-terminal appears.
CFGs are more powerful than regular grammars, allowing them to describe the structure of complex languages like programming languages and many aspects of natural languages (e.g., handling nested phrases and clauses in sentences). They provide a theoretical framework for building parsers, which are programs that analyse the structure of text according to a grammar. Parsers are crucial for tasks like compilers (understanding programming code) and natural language processing (understanding human language). CFGs can be used as a starting point for building machine translation systems, where the grammar helps identify the structure of sentences in different languages. While CFGs alone cannot capture all the complexities of natural language, they serve as a foundation for more advanced NLP techniques. Analysing and understanding sentence structure is a crucial step in many NLP tasks like sentiment analysis or text summarisation. Finally, they play a role in the theoretical study of language structure, helping linguists understand how human languages can be generated and parsed.
While context-free grammars (CFGs) provide a theoretical framework for understanding language structure, LLMs have a different and more complex relationship with grammars when it comes to generating and understanding text.
LLMs don’t explicitly use formal grammars. They don’t rely solely on predefined rules like CFGs to generate or understand text. LLMs are trained on massive amounts of text data, learning statistical patterns and relationships between words and phrases. LLMs use this knowledge to predict the next most likely word in a sequence, considering the context of previous words.
Through training, LLMs implicitly capture some aspects of grammars, like word order, sentence structure, and common phrasings. However, this capture is not based on explicit rules like CFGs. Instead, it’s based on the statistical patterns observed in the training data.
One of the most tangible benefits of LLMs with respect to CFGs consists in the fact that LLMs can handle more complex structures. They can go beyond the limitations of CFGs, dealing with long-range dependencies and nuances of natural language that are difficult to capture with formal rules. For instance, while CFGs struggle with agreement between distant words, LLMs can learn these relationships through statistical patterns in real-world text usage.
Of course LLMs don’t come without limitations. In particular they are limited by the quality and diversity of the data they are trained on. Biases and limitations in the data can be reflected in their outputs. LLMs can still generate grammatically incorrect or nonsensical outputs, especially in complex or unfamiliar contexts. Such outputs are usually referred to as hallucinations, and they are inevitable. Therefore, the relationship between LLMs and formal grammars is not a direct one. LLMs learn from data and capture statistical patterns, which indirectly relate to and sometimes go beyond the capabilities of formal grammars like CFGs.
While the answer is nuanced, LLMs generally demonstrate greater power in dealing with artificial languages compared to natural languages for several reasons, such as:
● Well-defined structure: Artificial languages, like programming languages
or logic formalisms, have clearly defined rules and structures that are explicitly designed and documented. This makes them more predictable and easier for LLMs to learn from a statistical perspective.
● Smaller vocabulary and simpler grammar: Artificial languages typically have a smaller vocabulary and simpler grammar compared to natural languages. This reduces the complexity involved in understanding and generating valid sequences in the language, making it easier for LLMs to achieve higher accuracy.
● Limited ambiguity: Artificial languages are often designed to be less ambiguous than natural languages. This means there are fewer potential interpretations for a given sequence of symbols, making it easier for LLMs to identify the intended meaning.
However, this doesn’t imply a complete inability to handle natural languages. LLMs have shown remarkable progress in dealing with natural languages, mainly due to:
● Massive training data: They are trained on vast amounts of real-world text, allowing them to capture complex statistical patterns and nuances that may not be explicitly defined by grammar rules
● Adaptability: They can adapt to different contexts and styles based on the data they are exposed to, providing a more flexible approach compared to the rigid rules of artificial languages.
Therefore, while LLMs generally perform better with artificial languages due to their well-defined nature, their ability to handle natural languages is constantly improving due to advancements in training data and model architecture, making them increasingly effective in dealing with the complexities of human language. It’s crucial to remember that LLMs are not perfect in either domain. They can still struggle with complex tasks, generate nonsensical outputs, and perpetuate biases present in their training data.
While LLMs generally perform better with artificial languages due to their well-defined nature, their ability to handle natural languages is constantly improving due to advancements in training data and model architecture, making them increasingly effective in dealing with the complexities of human language.
An aspect that remains to be fully explored and can play a fundamental role in understanding and generating languages, especially artificial ones, is the realm of compilers.
Compilers are programs that translate code written in a high-level programming language (source code) into a lower-level language (machine code) that a computer can understand and execute.
Compilers and LLMs can form a potent combination for several compelling reasons:
1. Enhanced Code Generation: LLMs can facilitate pregenerating code skeletons or suggest completions based on the context of existing code, as is already the case. Moreover, considering the grammar and potential production rules of the language can further elevate the quality of the generated code.
2. Improved Error Detection: LLMs, trained on extensive datasets of code containing identified errors, can effectively pinpoint potential issues in code that traditional compilers might overlook. This capability extends to detecting inconsistencies, potential security vulnerabilities, or inefficiencies in the code structure.
3. Natural Language Programming: Integrating LLMs with compilers can enable the use of more natural language-like instructions for code generation. This approach holds promise for making programming more accessible to individuals who are not proficient in conventional programming languages.
While these ideas serve to illustrate the possibilities, the technology is already sufficiently mature to realise these objectives, except in the realm of natural language. This implies that, probably, we are employing LLMs to address a problem for which they were not originally intended.
At Intrepid AI (intrepid.ai), we envision a future where AI systems not only excel in comprehending and generating artificial languages but also demonstrate remarkable proficiency in navigating the complexities of natural languages.
We invite researchers, developers, and enthusiasts alike to join us² on our journey towards unlocking the full potential of AI-assisted language understanding and generation. Together, we can harness the power of LLMs and compilers to create intelligent robotics systems that redefine the way we interact with and interpret language in the digital age.