Site Reliability Engineering

Developing a data driven tool to estimate the cost of incidents

Data Driveness is one of our core values at HelloFresh. We are proud of taking decisions based on strong evidence and not on gut feelings. This includes of course how we assess the impact on incidents affecting our services. But what is considered an incident in HelloFresh?

An incident is considered everything having an impact on our business. Imagine for example that the service dealing with the meal selection process is down for one hour and the customers using our App are not able to change their subscription plans. That’s what we would consider to be an incident. As you might expect, how we deal with incidents has an important impact on HelloFresh’s success. That’s why we have a detailed Incident Management Process describing what to do throughout the life of an Incident. This process ends with an analysis, in which our SRE (Site Reliability Engineering) engineers analyse the root cause of the incident with the objective of:

Prevent future incidents
Recording efficient ways of quickly mitigating/solving similar incidents
Assess the impact of the incident
This last point has been historically difficult to achieve due to different reasons. In Product Analytics we wanted to support our colleagues in SRE, so we created a model to predict the monetary costs of incidents affecting our conversion funnel.

When one of the services composing this funnel is down (let’s say for example that there is a problem in the service processing the payment methods of our prospective customers) we are therefore not able to convert visitors into customers. This kind of downtime impacts our business.

The model is based on previous conversions time series data and in a first step aims to predict the number of conversions considering normal operation. If an incident occurs, we can simply compare the predicted conversions with the actual ones during the incident duration and calculate a number of lost conversions and the associated lost revenue.

When our Product Analytics Team started to think about how we could achieve such a model, we established the following goals:

Simplicity: the model should ideally be based only on a smart implementation of basic statistics
Data Product: in HelloFresh we are product oriented and that includes our data. We wanted to create a tool that is used daily by our engineers
Flexibility: we wanted the model to require minimal user input and therefore be able to automatically detect incident periods for all our markets
Accuracy: we already had some tools doing similar things but they were not accurate enough
Previous Approach
As I just mentioned, we already had a model that was being used to predict customer conversions on a certain date. This previous approach was averaging the previous four data point conversions (weeks in this case) to calculate the target date conversion.

The model was super simple and easy to implement as a data product but too inaccurate. We were using only four data points to derive an estimate, leading to significantly wide 95% confidence intervals.

Previous Approach Predicted / Actual Conversions — 4 Data Points
An intuitive way to try to improve the accuracy of this approach was including more data points. Unfortunately, this did not help as the model error was dramatically increasing.

Previous Approach Predicted / Actual Conversions — 32 Data Points
This actually makes sense as we have a very high seasonality in our historical conversions data and a significant number of anomalies, days where due for example to a strong marketing campaign the conversions are unusually high.

We therefore needed to approach the problem a bit differently.

Improved Model
The improved model we developed leverages one fundamental fact: although conversions historically have a very high week to week seasonality, the way the conversions distribute intraday is fairly constant on a weekly basis. This sounds much more complicated than it is. Let’s take a look at an example.

Intraday Relative conversions for four consecutive data points
As we can see, the intraday distribution of conversions on a certain day of the week is pretty constant.

This fact can be used to achieve a pretty accurate and simple model. In addition, the use of relative values allows us to use the incident’s day’s actual conversions in our prediction and consider anomalies. Furthermore, we can now use bigger samples of past data, increasing the confidence in the prediction and therefore reduce the 95% confidence intervals width.

The following snapshot shows a flow diagram of the steps carried out by the model to estimate the lost conversions on a dummy incident.

Model Flow Diagram
And in the picture below we have the actual conversions versus the model estimates, showing a close match between the estimates and actuals.

Previous Approach Predicted / Actual Conversions — Improved Model
Model Implementation and real usage examples
The model was developed on a DataBricks Cloud notebook using Python (PySpark, Pandas). Additionally the model is available in the form of a Python package that can be run locally.

It only requires two mandatory inputs from the user, the incident day and the affected country. On a normal operation day the output of the model looks like this.

Model Output — No Incident Detected
The model didn’t detect an incident and shows a time plot of the actual and expected conversions, closely matching each other. On the other hand, if we pass to the model a date in which we have an incident we will get something similar to this.

Model Output — Incident Detected
The model detects the incident’s start and end time and provides the number of Lost Conversions and the associated Lost Net Revenue. On the conversion timeplot the Incident Period is shaded red.

This short article showcased one of the projects we have been working on lately in HelloFresh Product Analytics. Our mission is to create actionable insights to support a product strategy that puts the customer’s needs at the heart of all decision-making. This mission can take very different forms.

In this case we created a model to estimate the cost of downtime caused by incidents. As mentioned previously it can be used by SRE engineers during root cause analyses. But it could also be used by Product Owners to prioritize projects targeting the reduction of cost time or as an alerting system to quickly detect and mitigate incidents, improving the user experience.

Leave a Reply Cancel reply