Bartmoss St Clair holds maths and physics degrees from Heidelberg University. He’s an AI researcher who previously worked at Harman International and Samsung, and has been a guest researcher at the Alexander von Humboldt Institute in Berlin. Currently, Bartmoss is Head of AI at LanguageTool. He’s also the creator of the open source community Secret Sauce AI.
In this interview, Bartmoss explains how his open source LLM has transformed LanguageTool into a sophisticated grammatical error correction application. LanguageTool does more than detect grammatical errors – it tailors the level of correction to the user and understands the nuances of modern language:
I was a physicist at university, but I found academic life just didn’t suit me. Through a connection with a colleague, I got involved in a very interesting project with natural language processing, dealing with automating content governance systems for banks. Then I founded a company to build up solutions for five or six different languages, and that’s when I discovered my passion.
So at LanguageTool, the one use case for us is grammatical error correction. Someone writes a sentence or some text, and they want their grammar checked, then the tool replaces it with the correct grammar. Of course, that’s a very basic use case. Language Tool has existed for about 20 years, but back then we didn’t use AI or machine learning.
As head of AI, one of the things I wanted to do was create a general grammatical error correction system (GEC) to catch all kinds of errors for all languages possible, and correct them. So initially, we needed a model that would correct the text someone has written. We had to decide whether to take an existing LLM like GPT-3 or GPT-4 and just use prompting, or to create our own model. And we found that creating our own model was cheaper, faster, and more accurate.
There was a lot to consider when creating a model that will run for millions of users: we had to make sure there was a good trade-off between the performance, speed, and price. Decoder models have become extremely popular, and we’ve seen a lot of scaling behind them. But sometimes the question is: how big do you need the model to be for your purpose? And do you have to benchmark and test that to see how well that works?
This is a real-time application that needs to work with low latency as users are typing in a document, or on a website, or anything in their browser. So there are actually plenty of ways you can use LanguageTool; while it needs to work as quickly as possible, the use cases are completely varied. It can be academic writing. It could be for business (we have both B2B and B2C customers). So we don’t have a one-size-fits-all solution for our customers. But we always have to find a good balance between accurate grammar correction and not annoying people.
You might assume that grammar’s either correct or not, but the reality is more nuanced. There are cases where users don’t like a rule even if it’s technically correct. In these cases, you have to debate whether you want to turn it off. Take, for example, the phrase ‘to whom’ or ‘for whom.’ Nowadays, people just say ‘who.’ Grammar changes over time, and we have to be mindful of that, and strive to strike a balance with our users.
We have several million users in many languages; we’ve over 4 million installations of the browser add-on, in fact. We have six primary languages, but we support over 30 different languages worldwide. And we also have employees from all over the world.
If you’re starting out by prototyping, maybe you want to use a very large decoder using a prompt so that it has some sort of emergent behaviour. But if you would like to do a task where you’re doing something like translation, generally sequence-to-sequence encoderdecoder models perform really well. You need to consider issues like context window. Do you want to work on large amounts of text, or do you just want to do it on a sentence level? There are plenty of things to consider.
We first aim to fine-tune the model with really good data, where you just have sentence pairs, or text pairs between one that could have mistakes in it and one that’s completely golden. Next, we check the model to see how it’s working. This is a complex process: there can be multiple grammar corrections for a sentence, and it must be able to handle that. This aspect is a very fascinating challenge where it’s not black and white.
Other challenges include issues with hallucinations or extreme edits. You want the output to retain the same meaning as the original sentence, and maybe you don’t want certain words changed, you just want to correct the grammar.
We like to offer users a multitude of different ways of correcting. Some users want to apply a certain style; some just want their grammar checked, and some want both. So, you have to be certain you’re changing language at the right level, in the right way, to meet the user’s specifications.
There are countless different tools to solve these types of problems: everything from checking edit distances with Levenshtein distance, or similarity with cosine similarity. You can also classify and tag sentence edits. This can reduce the issue of over editing, or over changing a sentence.
Occasionally, there’s also the question of: how much do you change a sentence before you shouldn’t change it anymore? Because LanguageTool has existed for 20 years, there are many standards and practices in there that have developed through building a rule-based system, and we’ve inherited these when moving into AI.
Absolutely. When it comes to basic things that could be formulated into rules, it’s very cheap, very accurate, and it just works. There’s no reason to fix that.
It’s these more complex, contextual based grammar issues where you can’t just create a simple rule for it because there are so many exceptions, that’s where machine learning is needed. And running machine learning models with inference in production could be pricier than just writing a simple rule with RegEx or Python.
One essential aspect is that we don’t just correct the grammar, but we explain to the user why; we give the reason for the correction.
One essential aspect is that we don’t just correct the grammar, but we explain to the user why; we give the reason for the correction. And every time we have a match, we also have to explain the reasoning behind that.
And this means that every match has a unique rule ID. So, for certain rule-based systems, you have an ID for those, and you have an ID for all the types of matches that can occur in the machine learning aspects, and you can then prioritise those rules.
For example, let’s say that nine times out of ten, there’s a rule that works with the rule-based system, but then it triggers for that one time when there’s a deeper context. In that case, you can prioritise the AI model over the rule-based system. This approach works really well. It ensures we don’t get stuck in endless correction loops.
We do keep our legacy rules where they function very well, and they always work. There are certain cases where something’s always going to be true, like capitalising at the beginning of a sentence or punctuation at the end, and so on.
But for general grammatical error correction, this is something that we worked on for a while, and we actually do have running in production. The system has many, many layers in production from rule-based systems to specific machine learning-based solutions, for one type of correction. And we also use the GEC, the general grammatical error correction, which can solve a lot.
In the end, it’s a mixture. We have opt-in user data that we collect from our website, and also golden data that we’ve annotated internally ourselves from our language experts. We have experts for every language, who go through and review and annotate data for us, which is extremely helpful. Furthermore, we also generate synthetic data.
Generally, when it comes to these models, you have to ask yourself: do you want to train these things in stages, or do you want to train them all in one go? Is it better to use synthetic data for a part of it, or should you just use fewer points of data and focus on quality?
There’s an interesting mixture of tasks and methods that we use with generation and user data, with data that we internally create and annotate. One important thing we have to watch out for is the distribution of the data. It must reflect how it is for the users. We always aim to keep that distribution as close to how it’s used as possible. That way, you can pick up common user errors – such as misplaced commas. You’ll notice these errors occur more often than certain other types of errors.
We’ve had to build tools from scratch to address that issue because they just didn’t exist before. As I mentioned earlier, you can get something marked wrong, but it’s technically correct. There’s more than one way to correct a sentence. Of course, the most obvious solution is to collect every possible variant of a correct version of a sentence. But that’s a momentous task, and it’s not always the best way to do it.
You need to ask yourself: what’s more important, precision or recall? When it comes to the users, we do have some fantastic analytics. We can see if users generally applied our suggestion, or if they said no to it, or just ignored it.
And we guide ourselves a lot by that. There’s no silver bullet. Instead, it’s a barrage of methodologies to ensure that we are giving our users the best possible quality corrections that they can trust, whether they’re writing a thesis or a legal document. And because we’ve been doing this for 20 years, we’ve ironed out a lot of those issues.
There are countless different measures we use for performance. There’s the typical F1 scores: when you’re training or fine-tuning a model with your evaluation dataset, this is something you look at. That’s not 100% foolproof, but it offers a good rule of thumb. Also, the user analytics. When you put something online, you can see very quickly if the users really hate it. So we do partial rollouts and A/B tests.
We also do manual reviews: taking subsets of data, and have professional language experts review it. These reviews are time-consuming, but extremely important. Our language experts have a very deep understanding of language, so they will catch any errors.
There are two main use cases for us with LLMs. One is general grammatical error correction, and the other is rewriting or rephrasing; for example, you can’t do paraphrasing without these kinds of models. That was a huge use case for us, something you couldn’t do before with previous systems.
And also with grammatical error correction, you can catch things that you couldn’t catch before LLMs. Back then, the best we could do was train much smaller models on a very specific error – where there’s a comma missing, for example. You would have to literally think of each and every type of error possible. Whereas when you’re training a model, it can catch a lot of things that are much deeper than you would ever think about; very deep contextual things. You can also change things with style, improve fluency. There are so many use cases for language that we can bring into all kinds of applications.
I use the tool myself, and even though I taught English for several years, I’ll discover that I missed a comma here and there. And that’s quite helpful!
Absolutely. There was a cost trade-off there because we host all of our own infrastructure. With our GPU servers, it’s a no-brainer. You get higher retention when you find more errors and improve people’s spelling and grammar. And consequently, you get better conversion rates for our premium product. So, it’s a worthy investment for us.
Of course, working to bring the cost down is something we actively do. We’ve worked a lot on compressing models and accelerating models. These things aren’t plug and play; you’ve got to build the framework, fix bugs etc.
I think to some degree they already do. We can, a lot of times, detect whether a longer text is generated. And I’ve noticed by playing around with different LLMs that they like to inject certain words all the time. And I don’t know if that’s for watermarking purposes or what that actually is because a lot of times they’re not very transparent about these things.
I think as we rely on LLMs more, we will definitely see a greater impact on language. But it’s always important to note that these things are just predicting the next token based on a huge amount of data from people. And so, as long as you’re training new models or fine-tuning new models, that’ll change with the times as well.
Language is very fluid, and it changes over time, and we have to change with it. And I think there’s nothing wrong with that.
We don’t use external APIs, especially for grammatical error correction. That would be a privacy concern because a lot of our users are very privacy-focused, and our data generally stays within the EU. And we don’t save data from our users unless they’ve opted in from the website.
And so, we keep that all in-house with processing. Everything we’re doing is completely built internally. And we use a variety of different types of open-source models, and we’re always experimenting with new ones.
Language is very fluid, and it changes over time, and we have to change with it.