Decision making in IT stacks
Below are 3 nicely formatted storage events:
|Event 1||Event 2||Event 3|
|Index = storage||Index = storage||Index = storage|
|Time = 2019-08-05 00:00:00||Time = 2019-08-05 00:10:00||Time = 2019-08-05 00:20:00|
|StorageServerName = EU_1||StorageServerName = EU_1||StorageServerName = EU_2|
|CapacityUsedGB = 550||CapacityUsedGB = 600||CapacityUsedGB = 990|
|CapacityTotalGB = 1000||CapacityTotalGB = 1000||CapacityTotalGB = 1000|
It should be clear from these events what the fields mean. For example, notice that Event 3 is about a server called EU_2, and around 20 past midnight this storage server had only 10 GB available space. We are now capable to give meaning to these events, like we give meaning to sentences with words in a particular order.
What more is there to the collecting of data? Be careful when ingesting data, since besides creating nice events you need to think about other factors as well. For example, how do you handle duplicate or missing data? And is it easy to combine this data with other types of data? Have you thought about cleaning data (for example, transforming “---$X$ 5 AUG 19 10.00 $X$---" into “Time = 2019-08-05 10:00:00”)?
Also think about error handling, licensing, security, documentation, etc…
As you can see, collecting data into a data lake is no easy task. However, by iterating on it step by step you will end up with a mature product. Just like raising a child until it reaches adulthood.
2. Data usage
Back to the baby-example: when it is able to make sense of the world, you can now interact with it. Or to put it in business language: you now have a well-structured data lake and are ready to discover insights and deliver value – for example create a prediction for capacity management.
First of all, you need to check if you have all the information you need. For example, from the compute data you can see how many physical hosts there are, where they are located and how much CPU and Memory they consume. But you can’t see which Virtual Machines are on them. This can be because you forgot to collect this data, in which case you should go back to the previous step, just like children who sometimes have to repeat a class because they forgot to study. Or the sources of the collected data aren’t linked yet. In practice this often is a difficult step, as registration systems are not always completely up to date, have missing fields or follow a different structure. Once you have figured out how to combine all the data together, you are ready to generate value and insights from the data.
Mostly, you start by creating a dashboard. This is useful to validate the data, as well as broadening the understanding of the data. Together with your customer you think of what visuals might help in determining capacity risks quicker. It might take a while to get it right, as often it is hard to define in advance which results will help most.
When dashboards are in place, it is time to advance to the next phase. Like a child goes from primary school to secondary school and then possibly university. In business, this means answering more difficult but important questions to obtain more valuable results. For capacity planning, it is important to order new equipment before reaching unstable levels. Because we need to prevent a bad situation from happening, a scheduled report or alert works much better than a static dashboard. Since if nobody looks at the dashboard the bad situation can be missed completely, whereas an alert is always triggered at the right time. In order to determine when to order new equipment, we need to make calculations based on supply and demand. The challenge is that supply and demand are often incomplete and very volatile, meaning that we should use smart algorithms to find a practical solution. Also, the interoperability of the entire stack makes it quite hard to alert correctly and in time on when and what to order.
Alerting already adds value, but there is still more value to be gained. Capacity management is something that should be planned, instead of treated as trying to prevent incidents. This makes predictions of the future necessary. Not only predictions to define when to order new hardware, but also about dividing load across systems, looking at spare parts, indicating where there is still free space, and redistributing clusters in order to save cost. This already helps to prevent problems before they would pop up, and simultaneously saves money by using a smart allocation of resources. Just like young adults balance their time in order to be healthy, complete their study, go out with friends and have fun.
The final step is to create some form of automated capacity planning. Long before there is a capacity risk, it should be automatically checked if there are spare parts available that can support a predicted new load, or if re-balancing might solve the issue. If this is not the case, then a new order will be created automatically, and a alerting mail is sent to an engineer, including a report for the engineer to approve or decline. Humans should never be phased out completely, as capacity management mistakes can cost large sums of money. Despite their usefulness in the modern world, always be careful with fully automated algorithms .
I hope this blog explained how Data Scientists transform data into value based on this IT Stack example. But as you might have noticed from the introduction, the most value that can ever be produced from all kinds of different data, is you!
Back to overview