Rex VanHorn is Senior Enterprise IT Architect at Boehringer Ingelheim Pharmaceuticals. He’s worked at Boehringer for over 25 years, having specialised in automation through API interfacing and advanced applications. Rex earned an MBA in Finance from The Ohio State University, and an MSc degree in Artificial Intelligence. He’s currently pursuing a PhD in AI at UGA, focusing on advanced NLP techniques.
In this post, Rex outlines his organisation’s innovative approach to interpreting its quality data. He introduces Boehringer’s new semantic applications: the Trending Tool and Recurrence Check Tool. Rex explains how these applications discover, manage, and evaluate important trends. Over time, he explains, they could be used to automate the process entirely:
In 2022, our organisation was challenged to find an innovative way to find existing trends in our Quality data. After running through the usual statistical suspects – think regression analysis and business intelligence dashboards – we were encouraged to think bigger. The system should be able to group similar, existing records together regardless of who entered them, where they were entered, and the language in which they were entered. Just as importantly, the system must be able to evaluate a new record and predict the type of trend it might fall into, as well as offer a potential path towards investigation and resolution. And of course, the application also needed to be compliant with the regulatory rules for applications assisting with low-risk processes in the pharmaceutical manufacturing space (i.e., GxP applications).
The Trending Tool (TT) is backwards-facing and looks over the sea of existing information, and clusters them into groups by the type of record they represent. These collections of records are then the trends.
There are four main processes that underpin the Trending Tool:
We focus on the unstructured text in the records, such as the short and detailed descriptions. The goal is to find every major trend hiding in the data, which we do by clustering the similar records together. The process begins by transforming the unstructured text into structured data through the creation of embeddings. With embeddings, we can identify significant words and ideas that carry the most weight and relevance within the given context. First, we translate every record into a common language (i.e., English), then comes the crucial step of clustering. Clustering algorithms analyse the embeddings and group similar pieces of text into clusters. In the final stage, we label the clusters/trends by automatically assigning some meaningful phrase to each cluster, which briefly describes how the records in the cluster are related. These labels attempt to provide a summary of the content within each cluster, making it easier for users to navigate and understand large volumes of unstructured data. With the records now assigned to their trend/cluster, they can be further segmented, and their occurrence can be graphed over time, revealing important information about our efforts to manage those trends. Our (human) employees then evaluate each trend, looking for new, emerging trends, or if known trends are being adequately mitigated. For example, if we find more deviations, month over month, then that strongly suggests that our mitigation efforts need to be reviewed. If, in contrast, we find that the number of deviations drops month per month, then that strongly supports the conclusion that our mitigation efforts are succeeding. The TT therefore provides both a broad, yet focused view into our quality processes, and does so regardless of the records’ input language or location.
The Trending Tool’s sister application is the Recurrence Check. The TT looks back over time, while the Recurrence Check Tool (RCT) looks towards the future. The purpose of the RCT is to take a brand new (and unclustered) record, read the unstructured text fields, find existing records similar to this new record, and predict to which trend this new record might belong. The RCT is an AI-enabled extension of the previous, manual recurrence check process, in which we perform a query-based recurrence check, whereby individuals manually query the GOTrack database looking for existing records whose field data match the new deviation record. The RCT still performs this task but does so at the push of a button, eliminating the need for any manual querying.
In addition to the automated database query, the RCT employs two new semantic search operations, looking first at individual record similarity, as well as at cluster similarity. The first operation, individual similarity, semantically compares the current record’s text data to every other record’s text data, independent of language or origination location, and returns the list of existing records that are most closely semantically related to the new record. The second operation again uses semantic similarity to find the record that is most semantically similar, and then determines if that record belongs to a cluster. If so, the RCT uses additional information to compare new and existing clustered records for potential similarity. The application of individual – and cluster – similarity provides an AI-enabled ‘four eyes’ principle to finding semantically similar records, while using these three methods together gives us an automated recurrence check that stays faithful to the original search, but also extends the search beyond keyword/field matching and includes semantically related records. Records whose fields/ values match through querying are presented to the user as a SQLbased query match. In contrast, those records whose fields/ values do not match the current record, but whose text fields are nonetheless semantically similar in meaning, independent of language, will have a similarity score between 0.0 and 1.0. The similarity score is the calculated cosine similarity between the new and existing records’ vector embedding. A similarity score of 1.0 means that the records are completely similar. A score of 0 means the records are completely dissimilar, while a similarity score of 0.5 indicates that the records are partially related. Finally, those existing records that are part of a cluster, and are therefore associated with an existing quality trend, are presented to the system user with the trend/cluster number next to the similarity score. Users can then use this information to complete the regulatorily-required recurrence check.
Our long-term vision with the TT and RCT is to create an autonomously intelligent framework for trend and recurrence management, whereby the system alone can find trends using a myriad of information and intelligence perspectives, and then manage those trends without human interaction. For many obvious reasons, we did not pursue that vision in the initial release of the applications. Instead, releasing the TT and RCT productively meant introducing them with the requirement that a human review the information provided by the tools and take ultimate responsibility for the quality and accuracy of the recurrence check. This approach, using a framework of assistive intelligence, will give us the necessary data to verify that the intelligence within the applications is operating as expected, while guaranteeing the continued stability and oversight required of a regulatorily required process.
The AI team consisted of five internal employees (Kriti, Pascal, Tobias, Dr Jonas and me) and one external consultant, all of whom had various development experience, and all of whom had deep AI or data science experience. This team worked with the project team members to build the necessary data models, and then integrate them into our Quality application platform. In the end, we spent more than 4,000 hours building approximately 30 different models to deliver the clustering and similarity results that most closely aligned with the opinions of the subject matter experts. Ultimately, we found that HDBSCAN with a TF-IDF vectorizer and special preprocessing returned the best results for the TT. We were able to demonstrate that TF-IDF even outperformed most of the newer, transformer-based models. Note that one BERT-based model was shown to match TF-IDF’s vectorization performance but was shown to be computationally more expensive. On the other hand, we found that a fine-tuned version of the MS MARCO embedding model provided the best vectorization output for calculating semantic similarity with the RCT. A serendipitous side effect of this ‘four-eye’ application is that we cast a slightly wider net, achieving better results together than would be provided by using just one embedding model for both the TT and RCT.
We employ two different embedding models, TF-IDF and
MS MARCO. MS MARCO is a neural network-based, semantic embedding model that produces dense vectors while TF-IDF is a syntactically based algorithm that simply uses word frequencies to calculate sparse embedding vectors. The inexorable march of technology would entice one to believe that MS MARCO should be superior to TF-IDF, especially considering the MS MARCO model was fine-tuned on record text to further learn semantic associations within the records’ information. Nonetheless, TF-IDF proved to be a worthy weapon in our toolbox. We attribute its performance partially to the preprocessing we performed on the record text (e.g., stemming and lemmatization) tailored to the type of text in this domain. While we need to do more testing to verify or refute our hypothesis, we believe that TF-IDF has an advantage over open-source, trained embedding models when the record text to embed contains proprietary data such as product names, which would not likely appear in an open-source embedding model’s pretraining corpus. Because TF-IDF does not ‘care’ about the ‘meanings’ of such words, but rather focuses on its frequencies, it’s arguably better able to handle these kinds of proprietary words.
As mentioned, the TT and RCT were designed to be used as part of a regulated process, which meant that the tools needed to be validated. While there is guidance from the various regulatory agencies on how to build, use and maintain AI applications in a regulated space, the prevailing regulations still lean heavily on ‘classic validation’ techniques, which prove the function’s reliability through repeated demonstration that entering a specific input yields a known, expected output. That paradigm has worked wonders over the past decades, but it doesn’t naturally apply to situations in which the output is not known, or where valid variations among the ‘correct’ answers could exist. For example, imagine you gave a list of 20 popular, temporally diverse books to a group of friends, and asked them to classify them into two groups: Classics and nonClassics. Unless your list consists of timeless, universally loved masterpieces along with absolute dreck, it is incredibly unlikely that your collection of friends will generate the exact same groupings. And even more unlikely if you ask them to group the stories into lists by ‘type.’ Let’s assume there is at least some agreement. Are some friends right, and the rest unread rubes? Perhaps. I don’t know your friends. But it’s more likely that the subjective nature of the task leaves room for varying degrees of correctness.
More data about the things to be clustered may not necessarily lead to universal cluster agreement, either. Every year as the weather turns cold, American college football fans turn their attention to the annual celebration of creating two groups of college football teams: those who have earned the right to play for the national championship, and those who get to watch it at home. Despite having every datapoint that could ever conceivably lead to the objective determination of the best four teams in the country, there are as many clusters as there are people opining. In fact, even the experts, whose entire professional purpose revolves around this singular purpose, rarely agree.
These contentions demonstrate the inherent difficulty in applying classic validation to an AI-enabled application, whose purpose very well may be to deliver the answers we couldn’t calculate or anticipate on our own. Indeed, even relatively proscriptive AI such as HDBSCAN and TF-IDF depend on the complete set of datapoints to calculate the output. As the ebbs and flows of data push semantic associations one way one month and the other way the next, one college football team may be worthy of the playoffs today but may be pushed out tomorrow when another team’s resume shines slightly brighter in the light of another day. This shift may not come through any of the team’s actions, but rather result from the gravitational forces exerted through the actions of the other teams. For this reason, we employed a validation approach that recognised the inherent differences between a normal, deterministic operation and an intelligent one, staying faithful to both the letter and spirit of the validation process, while demonstrating the operation’s correctness even despite not knowing exactly what the output would be. We did this primarily by demonstrating longitudinal stability through human review. Specifically, we created 12 test sets, slicing the data into 2-year windows that progressively slid over the dataset, greeting the new datapoints while forgetting the old. Each slice of data was reviewed to determine how much change occurred between and within the groupings, and if the groups were correct. We established an accepted threshold for the variation in size in the main clusters and showed that the composition of these clusters was always within the threshold, thereby demonstrating stability. More importantly, a human expert judged each slice’s composition, and certified that the clusters were consistent with the expected quality of human clustering.
Having successfully released our company’s first GxP AI application into production, we have turned our sights on the future, and intend to deliver three important updates. First and foremost, we are working with human experts to understand how the clustering and semantic search operations can be improved. This may include offering multiple clustering operations (e.g., one for big sites and one for smaller sites) or performing the clustering operation with multiple models and algorithms and using a mixof-experts approach to return the optimal results.
Similarly, we are considering our opportunities for continuously finetuning our MS MARCO embedding model with new and extended data. We are particularly excited about the possibility to further fine-tune the MS MARCO model with human feedback using triplet loss, which is a loss function that takes positively and negatively correlated examples and seeks to push embedding results towards the positive examples and away from negative examples. Fortunately, we already have the data to perform this operation without human intervention; we know the records that the human experts indicated were related (the ones they chose) and the records that the human experts felt were not related (the ones they were recommended but did not choose). With this information we can automatically incorporate triplet loss to improve our model’s performance.
Finally, we are investigating the possibility of using tools and techniques like density-based clustering validation (DBCV) and other validational metrics to further demonstrate stability over time. We believe that through consistency and both system-based and humanbased oversight, we can automate the process of updating and validating our models, providing increasing performance over time.
Creating and releasing Boehringer’s first fully validated, AI-enabled application in the GxP space was a challenging and rewarding experience. Our team is delighted that we could deliver such a valuable, cutting-edge application, which has already delivered numerous insights and time savings for the company, and more importantly, has helped us better supply our patients – both humans and animals – with the medicines they need, when they need them. Looking ahead, we are excited about the significant advancements we can make and the similar applications we can develop, using the TT and RCT as models. Pun intended.