Biobase: Revolutionising Biopharmaceutical Data Integration and Decision-Making for Boehringer Ingelheim ByBettina Knapp

width= Bettina Knapp studied bioinformatics and earned a doctorate in computer science from Heidelberg University. In 2015, she started work at Boehringer Ingelheim as a team leader in biostatistics in Biopharmaceutical Manufacturing, and later as a principal statistician. Since 2018, Bettina has been leading a laboratory in cell culture development, and in February 2024 she took over the role of Product Owner in a scrum team to develop Biobase.
In this post, Bettina Knapp gives us an insight into the creation of Biobase, an innovative web application which has transformed biopharmaceutical development and production at Boehringer Ingelheim. Biobase integrates biologics development and manufacturing data to create a high-quality FAIR data landscape. Benefits of the application include enhanced data quality and consistency, and seamless data visualisation and analysis:

INTRODUCTION TO BIOBASE

Boehringer Ingelheim is a family-owned pharmaceutical company founded in 1885 in Ingelheim, Germany. We focus on human pharma and animal health, and have research and development as well as commercial manufacturing sites all over the world.

Biobase is a Boehringer Ingelheim internallydeveloped web application for use within biopharmaceutical development and production. It provides access to a mapped and harmonised dataset of biopharmaceutical development and manufacturing data. The application was developed in an agile scrum team and uses commercial software parts to run the pipeline and store the data, mainly a Boehringer Ingelheim environment, which runs with tools on the AWS (Amazon Web Services) cloud. The Biobase team consists of five developers, one scrum master, four subject matter experts (SMEs) and one product owner. The stakeholders and users of Biobase come from different departments and areas within Biopharma of Boehringer Ingelheim itself, and from other sources around the world.

The mission of Biobase is the integration of biologics development and manufacturing data to create a high-quality FAIR data landscape. Biobase aims to enable fast data and model-driven decision-making and digital innovation for biologicals development and manufacturing.

The training material for Biobase users consists of short walkthrough videos of around 5-10 minutes, documentation, and in-person training sessions given by the Biobase team.

Benefits of the Biobase approach include enhanced data quality and consistency, as well as a facilitated and automated data visualisation and analysis.

MAJOR USE CASES

The motivation for the development in Biobase is sourced in the following use cases:

Data-sharing: Data is shared and discussed with peers directly within Biobase and there is no need for email, download or Excel spreadsheet sharing.
Reporting and visualisation: The Biobase user interface allows users to select, filter, and display the data needed for reporting. An export in PDF, XLSX or CSV format is possible. Besides the user interface, Biobase allows us to link the data with other visualisation tools (e.g. Spotfire) which gives the user more flexibility to build up specific reporting, visualisation and data modelling requirements.
Data first: The harmonised data within Biobase enables data-driven experimental planning by using data from previous projects for scientific decisionmaking. It is the basis for any kind of cross-project data analysis such as modelling.
Faster decisions: Biobase allows users to back up data-driven project decisions with cross-project and cross-scale data. Decisions can thereby be made much faster.
Troubleshooting: Biobase can, for example, display the connection of different process steps within biopharmaceutical development, or easily show the information of used media. This enables fast and easy troubleshooting and root-cause analysis in case of problems. Direct correlations across the processes are easily done in Biobase.
Transfer of a process: Data coming from different manufacturing sites is directly available and harmonised in Biobase. This saves time in data collection and reporting.
Submission and filing of a proce ss: Biobase supports rapid submission and filing due to facilitated data collection and reporting options.
Automated modelling: Being able to access and programmatically ( Python/R/SQL /etc.) process Biobase data without the need to export it, allows us to easily embed the data in modelling workflows. Due to the harmonisation of the data, an upscaling of workflows – for example from one project to many projects – is really easy with Biobase. Thereby sophisticated analyses are possible and support the transition to in-silico development. Further, the possibility of leveraging cloud computing resources is available.

VALUE PROPOSITION

The value proposition of Biobase spans various aspects of the biopharmaceutical development process. Firstly, it accelerates pipeline impact, affecting each biopharmaceutical product from the start of development through to submission to health authorities, and even during the commercial production phase of a product. Specific use cases, especially at interfaces during transfers between different sites, have the most significant impact. In instances of troubleshooting, the ability to instantly access data is crucial, making Biobase integral to different concepts and necessary for fast drug development.

Secondly, Biobase contributes to cost efficiency and savings. Almost every workflow requires data for reporting, team discussions, and much more. In terms of operational excellence, Biobase ensures that all workflows use a single source of truth for data. The workflow requirements are designed and implemented by SMEs and key users, enhancing customer centricity.

From a security and compliance perspective, Biobase operates at a non-GxP (GxP = good practice quality guidelines and regulation within pharmaceutical and food industry) compliance level, with full technical implementation of data integrity measures. Data security and compliance are ensured by utilising Boehringer’s infrastructure and technology.

In the realm of sustainability, environmental, and social considerations, Biobase’s ability to find and analyse historical data avoids the need for repeating similar experiments, fostering a data-first approach. This also aids in shifting mindsets by leveraging synergies across different departments.

Finally, for business continuity, it is imperative to establish data roles and responsibilities for data standards to ensure a continuous run mode for Biobase. As more and more workflows rely on Biobase data, the need for business continuity measures built into the data pipeline becomes increasingly important.

BIGGEST CHALLENGES AND HURDLES

The biggest challenges within the development of Biobase are:

Heterogeneous landscape of data sources ranging from pure Excel tables to diverse kinds of databases.
Different ‘vocabulary’ of experts as well as often undefined naming conventions along the process chain in bioprocess development and along the bioprocess itself.
Cultural barriers: digitalisation should not be seen as a different discipline but should be part of everyone’s work on all levels. It is the basis of every model generation and data interpretation, but is often tedious and needs an investment.

How to overcome these challenges is discussed in the following chapters.

TECHNICAL BACKGROUND OF BIOBASE TO COMBINE DIFFERENT DATA SOURCES

The data coming from the very heterogeneous landscape of data sources is collected within Biobase. Several automated steps of data pre-processing and data harmonisation exist, as well as data linkage tools to ensure consistent data reporting and scientifically sound representation of data categories (see Figure 1).

The data can be shown as given in the source systems to reproduce the original presentation, or in a harmonised manner which allows an immediate reporting and analysis of data coming from different sources.

During the data processing, Biobase ensures documentation and traceability of the process by using the following tools and features:

Figure 1

Data Processing Pipeline. Raw Data collection from the different source systems is followed by a data preprocessing, data harmonisation, data linkage and results finally in a standardised and consistent set of data stored in the Biobase master data.

Technical tools:

Bitbucket: Version control, track changes in configurations
Jira/Confluence: Critical Biobase tables and columns are well-documented in confluence pages, showcasing the concept and specifications, contributing to the credibility of the data
Backend tools: Automated pipeline monitoring, data drift alerts
CI/CD: E2E function testing on unit features, data quality gates

Procedures:

Microsoft Teams: bug reporting and user feedback
SME community: collaboration for data inspection and verification

Biobase employs a data lineage interactive notebook to represent the flow of data, simplifying the understanding of complex data journeys (see Figure 2). Data quality checks are performed during harmonisation and then reported in so-called data drift reports. In more detail, the basic principle of ensuring data lineage is to apply a unique hash value to each ingested raw data point that will always be kept throughout the data journey. As a result, it is technically possible to trace each data value in the results tables and graphs back to its original raw data point that was originally ingested.

Figure

Data journey. Raw data tables are transferred to intermediate tables where rigorous data validation checks and quality assessments are performed and reported (data drift reports). Final tables can be tracked back to the source system via hash values.

BIOBASE DATA HARMONISATION TO MAKE A JOINT DATA ‘VOCABULARY’

To overcome the challenge of having different vocabulary within biologicals development, Biobase harmonises the data coming from different data sources. Biobase uses a harmonisation tool which provides a clear and complete view of mapping rules defined by the SMEs, further enhancing transparency and traceability throughout the data transformation process. The developers of Biobase do have a specific account type that allows them to influence the harmonisation rules for the respective data pool they are working on. The release and the changes are documented. The users of the Biobase productive system do not have access to the harmonisation rules as such. The standard user cannot change the data pipeline and cannot change any data in the source systems. The Biobase team does not change any raw data value as the data is ingested from a data warehouse only. An audit trail records data transformation, source details, and any changes made during integration. Version control is implemented to track modifications and ensure transparency.

With the se measures, the segregation of duties is assured (see Table 1) and the workflow of the data harmonisation is separated into the two Biobase environments, i.e. development (Dev) and productive (Prod) environment:

Biobase Dev Environment – Harmonisation rules are settled, defined and verified Biobase Dev Environment – Data testing by SMEs Biobase Prod Environment – Documented implementation and release

Table 1

Segregation of duties

ROLE – RIGHTS

Standard user – Read-only, no access to harmonisation rules

Biobase SME – Harmonisation and verification of harmonisation rule s

Developer – Generation of harmonisation rules in (Biobase Dev)

Administrator – Releases new features and harmonisation rules into (Biobase Prod)

The following data-sharing terms are given in Biobase:

Purpos e of Data-Sharing: The purpose of datasharing is to have a combined platform for data access across different source systems and is purely non-GxP, independent of the GxP status of the source systems.
Data Ownership: The data is owned by the user of Biobase and he takes full responsibility for data correctness and completeness.
Data Privacy and Security: The data is not protected or private, i.e. a user of Biobase can see all data but cannot change the data.
Data Access: All employees from Boehringer Ingelheim could in theory have access after successfully accomplishing the necessary training.
Data Usage: The data shall be used solely for nonGxP purposes.
Data Retention: The data is available in Biobase until the next update from the source as it reflects the status of the source systems (updates are usually done within 24 hours).
Data Quality and Inte grity: The data is shared as given in the source systems or in a harmonised manner.

Biobase adheres to the application of the diligence measures of ALCOA principles, in the source systems and in good scientific practice:

Attributable: Biobase can trace back each data point to the original data entry into the respective source system.
Legible: The data represented in Biobase is sorted according to the same criteria as in the source systems and enriched with process context, which should improve the legibility according to the scientific standards applied.
Contemporaneous: The pipeline of Biobase runs once per night automatically. Time and date stamps of the original raw data are copied and not altered by this process. However, Biobase provides additional control over changes in the source system, as a comparison of the same hashtag data is possible day by day. A data drift or change therefore can be automatically detected (data drift detection). As a side note: data drift detection also helps to identify potential errors in the data pipeline and serves as an additional safety feature.
Original: There is an additional safety margin for originality built into the Biobase system. The need to generate uncontrolled copies of the data, e.g. in Excel spreadsheets, is minimised, as you always have access to the original data. Despite Biobase being a copy of the raw data, in fact the introduction of hash values for each data point makes it always possible to assign the data to the right raw data value.
Accurate: As far as the scientific accuracy is concerned, the numerical value of the processed data is not at all altered in Biobase. There is only a harmonisation taking place to make sure that all the units and method descriptions are properly used, and in this respect, the data could be equipped with additional metadata. In case of the unit harmonisation and yield calculation, there is always the option to go back to the original value with a potentially different unit used in the raw data. Accuracy is often also defined as using the right number of digits and rounding principles. Therefore, rounding and truncation of the data is only performed on the copy (e.g. for better representation of the diagram). In the background, behind the respective hash value, there is always the value with the original number of digits unaltered.

In Biobase, all data is processed, as there is no pre-selection of data occurring. Biobase resides in an IT-controlled environment with backup and lifetime concept, so the data in Biobase will be protected from any loss or copy-paste error which can arise in manual data processing.

Finally, the availability and shareability of Biobase analysis makes a peer-to-peer review of findings much more transparent, and one can always refer to the dataset used in the analysis directly on Biobase.

HOW TO HANDLE CULTURAL CONCERNS

Digitalisation is not a different discipline (such as IT) but should be accessible to everyone on all hierarchical levels. It is well known that a datapoint without the necessary metadata is more or less useless in modelling approaches. Thus, not only the amount of data matters, but also the description of the data, to overcome the inconsistency problem which often occurs in data analysis. Biobase has proven in many use cases to be the solution to this problem: having good, documented, harmonised data is worth the investment! Besides the allocation of budget, efficient planning of resources and prioritisation of different tasks is most important to achieve cross-functional and global stakeholder management during the Biobase development. A constant recruitment of new users and use cases is essential to meet and align to the needs of the whole biologicals organisation.

SUMMARY

Biobase is an internally developed web application by Boehringer Ingelheim that provides access to a harmonised dataset of biological development and manufacturing data. The application, developed by a team of developers, SMEs, a scrum master, and a product owner, aims to integrate biologics data to create a high-quality FAIR data landscape.

The application serves various departments within Biopharma of Boehringer Ingelheim worldwide, and offers benefits such as enhanced data quality, consistency, and credibility. Major use cases include data-sharing, reporting and visualisation, data-driven experimental planning, faster decision-making, troubleshooting, data transfer, submission and filing, and automated modelling.

Despite the heterogeneous landscape of data sources and the cultural concerns related to digitalisation, Biobase manages to combine different data sources through a series of automated data pre-processing, harmonisation, and linkage steps. The application ensures data lineage and transparency through robust methods and welldocumented specifications.

Biobase harmonises data to create a joint data ‘vocabulary’ and maintains segregation of duties to ensure data integrity. The data-sharing terms are designed to ensure data ownership, privacy, security, access, usage, retention, quality, and integrity.

The application adheres to the ALCOA principles, ensuring that each data point can be traced back to the original data entry, is legible, contemporaneous, original, and accurate. Biobase emphasises that digitalisation should be part of everyone’s work, not a separate discipline and that the value of data lies not only in its quantity but also in its description. With that, we are very pleased to pave the way for any data analysis tools such as AI or machine learning to be used by everyone within Boehringer Ingelheim Biopharma in an easy and user-friendly manner.