As energy prices continue to skyrocket in the UK, the impact is felt most acutely by households already under financial strain. This is happening despite the government’s best efforts to stem the tide with measures such as the energy price guarantee. In response to this pressing issue, the energy sector is rising to the occasion. They are expanding their support initiatives for customers and making strides to ensure this assistance reaches as many eligible households* as possible.
EDF had already increased the support that they could offer to customers throughout 2022. This included a doubling of the financial assistance for struggling customers to £10m, partnerships with charities such as Income Max to help customers access financial support entitlements and offering £20m of funding for home insulation. However, EDF were keen to ensure that this support could be offered to the customers who needed it most.
To help solve this challenge the data science at EDF partnered with business stakeholders and used unsupervised machine learning and data from smart prepayment meters to proactively identify financially vulnerable customers. The development of this model and the need to rapidly scale the time to production for its use additionally led to a whole new Machine Learning Model Operationalisation Management (MLOps) solution for EDF. In this article, I will provide a detailed look at how this algorithm was developed and how this project shaped a new path to production for the data science team.
The Approach
The data science team at EDF takes an agile approach to tackling business problems with data science. At the core of this is the principle of working closely with the business end users and domain experts from the first day of the project. The final data science solution is shaped not only by the Data Scientists but also by the business stakeholders ensuring an actionable product is delivered.
Data science solutions are also developed iteratively with the team aiming to produce a lean but usable solution and to deploy to production quickly so that the end users can trial the output and generate a feedback loop for future improvements.
Using this approach, the development of a financial vulnerability detection algorithm started with a series of workshops with business users and Data Scientists. The outcome of these workshops was a target set of features that intuitively were likely to be indicative of financial vulnerability and an initial solution in the form of a simple clustering model to identify the targeted customer segments.
The Data
There are several methods available for customers to pay for the energy that they use. One method is known as Pay as You Go (PAYG). Customers utilising this payment method have meters where they pay in advance for the energy that they use, also commonly referred to as a top-up. The data available from PAYG smart meters are likely to contain markers of financial vulnerability such as periods of time with no credit, also known as self-disconnection, or erratic top-up patterns. For this reason, the features from the PAYG smart meter data were selected for the first iteration of the model.
An additional important data point the team also had available is the Priority Services Register (PSR) flag. The PSR is an industry-wide service where eligible customers can register so that they can access additional support from energy providers. The PSR flag records information about a wide range of vulnerabilities including financial vulnerability. As the PSR flag relies on proactive registration it is an imperfect indicator and therefore not accurate enough to act as a ground truth for a supervised learning solution. It would, however, later prove invaluable to validate the final model as I will cover in more detail later.
The Model
As an accurate ground truth label was unavailable for identifying financially vulnerable customer groups an unsupervised learning approach was taken in the form of clustering.
Clustering algorithms use methods to identify similarities between data points and to generate clusters of similar observations. The goal is to generate distinct clusters where the differences between observations within the clusters are minimised whilst the distinction between different clusters is maximised. A clustering algorithm was applied to the PAYG smart meter features to determine if there was a meaningful segmentation of customers.
The team experimented with several clustering algorithms on different feature sets. The results of these experiments led to the selection of a K-means model trained on just five features. As previously mentioned, the bias for the team at this stage of the project is to deliver a simple working solution for end-user testing and feedback. K-means being a simple, interpretable algorithm fitted this remit.
How Do You Validate Unsupervised Models?
The performance of unsupervised machine learning tasks, where ground truth labels are unavailable, is not as easy to measure as for supervised problems where we can count errors and calculate standard metrics such as precision and recall. Where no ground truth is available, evaluation metrics which assess whether observations that belong to the same cluster are more similar compared to observations in other clusters are used. Silhouette score is an example of this method. However, these do not help us to understand if a target segment has been identified correctly as is the case with the problem of detecting financially vulnerable customers.
The data science team at EDF tackled this problem in two ways.
The first involved performing validation using the PSR Flag. Although not a perfect ground truth label the assumption could be made that a financially vulnerable cluster of customers would contain a higher concentration of observations containing this flag.
Analysing the three clusters the data science team observed a noticeable difference in the prevalence of the PSR flag. This initial piece of analysis provided a strong indicator that at least two unique clusters had been identified. These were provisionally labelled as financially vulnerable and financially secure.
To give further validation that customers were being correctly identified the data science team performed further customer-level analysis and worked with the business experts. One important piece of analysis was looking at the top-up behaviour of individual customers over a period of time and plotting this against the number of self-disconnections.
The images below show two examples of customers. On the left is one from the financially vulnerable segment. Here we can observe an erratic pattern of top-up behaviour and several self-disconnections suggesting financial vulnerability. The chart on the right shows a customer from the segment that was labelled as financially secure. In this case, top-up behaviour is very regular and stable, and no self-disconnections are present.
Deploying to Production
Once an initial solution to the business problem had been developed the next step was to deploy this model into production so that it could be used in initial trials to proactively reach greater numbers of customers with support.
The existing route to production for data science models was slow and cumbersome. As is still common with many MLOps platforms the workflow was fragmented. It consisted of numerous tools and a large amount of infrastructure all being managed and built in-house.
The data platform team had recently completed a migration of the company-wide data lake into Snowflake, a cloud computing based platform for large-scale data storage, processing and analytics. The existing workflow required data to be moved from Snowflake into Amazon S3 buckets where the team could make use of Apache Spark running on Amazon EMR clusters to prepare data for machine learning.
This fragmentation of workflow, the reliance on external teams and the management of several different tools and a large amount of infrastructure, meant that the path to production was slow and did not provide autonomy for the data science team to deploy their own models.
The need to operationalise the financial vulnerability detection algorithm at speed prompted a complete re-think of the data science platforms for EDF and the team spent time assessing numerous third-party tools and looking for a solution. After several planning sessions, a cross-functional product team was formed consisting of DevOps engineers, Data Engineers, Data Scientists and MLOps engineers and the team began a proof of concept for a new MLOps platform.
Considerations for the Platform
Development of the platform took an agile approach with the new product team aiming to develop a minimum viable product (MVP) within three months. The MVP platform had several requirements necessary to meet the needs of deploying machine learning models and data transformations.
Data pipelines –The data science team at EDF often work with exceptionally large datasets. For example, the team commonly use smart meter data consisting of billions of rows of half-hourly meter readings. It was therefore essential for data scientists to have access a tool for big data transformation. Additionally, the security of data was paramount so minimising the need to move data for processing was key.
Development of new models –As well as being able to deploy existing models to production the data science team also needed to develop new data science models. Hosted Jupyter Notebooks with large-scale compute behind them, alongside secure access to the Snowflake data lake, was crucial for this work.
Deployment – At the time, the data science team models only required batch deployment. So, the MVP platform only needed to support the pre-computation of inferences on a regular basis rather than deploying the models as endpoints.
These requirements, plus the existing technology used within EDF led to the selection of two primary tools to use for the core part of the data science platform.
Snowpark Python
Snowpark Python is a Python API used for querying and processing data stored in the Snowflake data cloud. The API has a syntax like PySpark and enables efficient processing of substantial amounts of data using highly optimised Snowflake compute with the data science language of choice, Python. As the data required by the data science team was now stored in Snowflake and as the data science team were previously using PySpark to process data, Snowpark was an ideal choice to move to.
When the new data science platform was initially being developed Snowpark Python was still in private preview as it was a relatively new Snowflake product. The feedback the team were able to provide from the initial migration of existing PySpark data pipelines provided confidence to Snowflake to release this new tool into public preview. Making it available to all Snowflake customers.
The team now no longer needed to move the data out of Snowflake and into S3 storage to access EMR clusters. Instead, the Snowpark code could be executed within Snowflake where the data was already stored. Additionally, as the Snowflake infrastructure is a fully managed service it resulted in far less maintenance overhead for the team.
Amazon Sagemaker
Snowpark enabled more efficient data access and processing, but the team also needed to develop machine learning pipelines for models. As EDF was already using a large amount of Amazon Web Services (AWS) Cloud infrastructure, Amazon Sagemaker was selected for the machine learning components of the platform.
Sagemaker consists of a fully managed set of tools for developing and deploying machine learning models in the cloud. The EDF data science team primarily made use of two core functionalities. Sagemaker Studio provided hosted Jupyter Notebooks with flexible computing behind them which the team leveraged for the initial exploratory development of models.
Sagemaker Jobs were used to run production workloads for machine learning pipelines. These jobs consist of Python scripts and each job handles a different stage of the machine learning workflow. A processing job, alongside a custom Sagemaker image, enabled the team to execute Snowpark code within Snowflake for the pre-processing parts of the pipeline via the Sagemaker API. A separate training job handles the model training. A batch inference job then generates predictions and writes these back to the Snowflake database.
The Data Science Platform
A key component of the data science platform was the ability to be able to securely promote new models to production whilst maintaining the reliability of outputs for models already in operation.
To achieve this the product team created four segregated environments each containing a Snowflake database with access to the Snowpark Python API and an AWS account with access to Amazon Sagemaker.
The first environment is called Discovery. This is intended to be a place where the development of new data science products occurs. This environment is centred around Sagemaker Studio Jupyter notebooks for rapid prototyping of new models and data transformations.
Once a product is ready for production the code begins to be promoted into the next set of environments. The first of these is known as Production Development and is where the production quality machine learning and data pipelines are built. A further Pre-Production environment is used to test that these new pipelines integrate well with the existing code already in operation. Assuming all is well the code is promoted to the final production environment where Airflow is used to orchestrate and schedule the regular running of batch inferences or data transformations.
The development of the financial vulnerability detection algorithm in combination with this new platform has enabled EDF to proactively identify financially vulnerable customers which means that as an organisation EDF can provide additional support to safeguard them.
The data science team is now using the first MVP version of this new MLOps platform to deploy, not only the financial vulnerability detection algorithm, but also many data science products. The team is now focused on developing the platform further by adding additional capabilities for model monitoring, experiment tracking and a model registry.
These additional features will allow the team to scale the number of models running in production which will also mean a greater ability to use data science to improve the experience for all EDF customers.