SRE and efficient monitoring – getting the most out of your DevOps team
When you develop a software application, it is essential that the end user has a positive experience using it. The interactive process of CI/CD (Continuous Integration and Continuous Deployment) is used to continuously balance between having a reliable product that end users are happy with and in the meantime release new functionalities to the product. It is a transition from the traditional static defined Service Level Agreement (SLA) approach, toward a way of working where the focus is on the needs and experiences of the end user, at the beginning and during the life time of the product.
An SLA is often defined from a fragmented technical perspective: what can we deliver on the various technical components versus reasonable costs? End users don’t really care about the technical specs, they just want their software application to work. Take an email system for instance. You could have a situation where your end user is complaining that their email isn’t working because they see a blank screen. One of your SLA requirements is a guaranteed server uptime of 99,9%, so you start looking for the issue. It turned out that the server is working (the email host is running), but the connection to the database was lost. Hence the blank screen. You are actually compliant with that SLA. But the fact is that you have an unhappy end user. That is not in line with our way of working.
Living up to your end users' expectations
At Itility we work from an SRE (Site Reliability Engineering) point of view, where the focus continuously is on the end user and his experience. We begin with defining Service Level Objectives (SLO) together with customer representatives. This is the part where the end user will share his expectations about the application. Holding on to the email example, requirements can be like: I want my email to be available, it should response with fast loading times, I want it to be reliable without errors, and I want to see the correct content. Initially, these objectives can be fairly general and highly subjective. What does the end user perceive as always available, or how long should opening the application take for the user to perceive it as fast?
That is why the next step is important: translating the SLO sets defined by the key user(s) into SLIs. These Service Level Indicators (SLI) are the metrics on component level we measure while offering the functionality to the end user in parallel. They offer the guidelines for improvement and fine tuning, thus being able to realize the objectives. In depth knowledge of the software application is required for this part. The technical application expert needs to draw up the landscape architecture, allowing us to find the relevant components to measure in appropriate SLIs. For the email example resulting in a specification that not only the server should have an uptime of 99,9%, but the database should also be sufficiently available.
Measure your performance
After the Service Level Objectives and Indicators are defined, we start monitoring the application to see if we meet those requirements. Next step is to act upon the results. That is why we have created a dashboard with all SLIs, providing us with a clear overview of the application performance. For good comparison all measuring units are translated to a value between 0 and 100% (of course there is an additional tab with the ‘original’ measuring unit like seconds, numbers or percentage). Red items are falling short on the requirements, but the green ones are meeting the targets.
Left: Report of SLI performances IMIP application – first run. Right: Report of SLI Performances IMIP application – current status
Don't fix what's not broken
The first step is not to start repairing what seems broken at first sight, looking at the SLI dashboard. During monthly review meetings with the end users the results are evaluated. Important is their perception on how the software application is running. As being responsible for the Itility Cloud Control platform team, we initially agreed with our users that, for instance, the total Puppet run duration should complete in 120 seconds. But it turned out to be 180 seconds. If the end user perceives this waiting time as acceptable, then you don’t need to put effort into fixing it, you adapt the SLO instead. In this case the Indicator-goal was set too tight, resulting in adjusting the goal instead of improving the Puppet run duration. It is a continuous process of finding the balance between measuring, spending time on repairing (operations) and spending time on improvements and new features (development). The example of the Puppet run duration showed that sometimes lowering a goal yields in a more satisfied end user because a DevOps team can spend more time on things that are important to the end user.
The essence of SRE is continuous improvement and inherently an on-going process. This means not to drag along the startup phase because you think you might miss something, but just get on with it. Decide on a starting point and start measuring. When your dashboard is showing all green (you met all the required Indicators), but if you still have unhappy end users: evaluate and adjust.
Finding the balance between Dev and Ops
It is Important to review the results together with the end user and start fine tuning your indicators: which Indicators should be measured that aren’t there at this moment, are there Indicators where we didn’t meet the goals (how can we fix this), and should we loosen up any indicator goals? This is an interactive process where you are continuously looking for the balance to meet the goals (keep your dashboard items in green), but not setting those goals so tight that you only have time to fix issues with the application instead of developing it further. The optimal target is to provide the end user with a positive experience when using the application from a functional and non-functional perspective.
Also read this elaborate explanation about SRE and how it relates to DevOps.