That is why the next step is important: translating the SLO sets defined by the key user(s) into SLIs (Service Level Indicators). These SLIs are metrics we measure on a component level, for those components used to deliver functionality to the end user. They offer guidelines for improving and finetuning those components to realize the SLO. In-depth knowledge of the software application is required for this part. The technical application expert needs to draw up the landscape architecture, allowing us to find the relevant components to measure via appropriate SLIs. For the email example, this results in SLIs that do not only cover server uptime, but database availability as well.
Measure your performance
After the Service Level Objectives and Indicators are defined, we start monitoring the application to see if we meet those requirements. Next step is to act upon the results. That is why we have created a dashboard with all SLIs, providing us with a clear overview of the application’s performance. For good comparison, all measuring units are translated to a value between 0 and 100% (of course there is an additional tab with the ‘original’ measuring unit like seconds, numbers, or percentage). Red items are falling short on the requirements, but the green ones are meeting the targets.
Left: Report of SLI performances IMIP application – first run. Right: Report of SLI Performances IMIP application – current status
Do not fix what is not broken
Although it might seem counterintuitive, the first step when something seems to be broken on the SLI dashboard is not to repair it. First, it is important to evaluate the end user’s experience. We do this via monthly review meetings, in which our end users are asked to explain their perception on the application’s performance.
For example, our Itility Cloud Control platform team is responsible for managing an application called IMIP. We agreed with our end users that the application should be ‘always available’ and ‘fast enough’, so we had our SLOs. Translating “fast enough” to SLIs, we initially agreed with our end users that IMIP job durations should be completed within 120 seconds. Later, we discovered that the job runs took 180 seconds to complete instead, so we did not meet our SLIs.
Instead of immediately refactoring the IMIP application, we first called our end users to an evaluation meeting in which we discussed their experience with its performance. They explained that they were satisfied with the stability of IMIP and speed at which it completed its jobs. So, it turned out that we had set our indicator goal too tight. The end user perceived this waiting time as acceptable, so we did not have to put effort into improving the run time. We simply adjusted the SLI instead. As a result, we were able to spend time on other features that were important to our end users.
Finding the balance between Dev and Ops
This example illustrates the continuous process of finding a balance between measuring, spending time on repairing (operations) and spending time on improvements and new features (development). Sometimes lowering a goal results in a more satisfied end user because the DevOps team can then spend its time on more important matters.
The essence of SRE is continuous improvement, which is inherently an ongoing process. This means we should not dwell on the startup phase because we think we might miss something, but rather just get on with it. Decide on a starting point and start measuring. If your SLI dashboard shows only green values (you met all the required SLIs), but your end users are unhappy, the solution is clear: evaluate and adjust.
It is important to keep reviewing the results together with your end user and to start finetuning your SLIs. Which indicators should be measured that are not in place at this moment? Did we miss our targets for any of the indicators? How can we fix our performance? Do we need to fix our performance? This is an interactive process where you continuously look for the balance between spending time on application stability and the introduction of new features. The ultimate goal: to provide end users with a positive experience when using the application.
Want to know more? Read this elaborate explanation about SRE and how it relates to DevOps.
