Predicting software build duration – a regression problem
When you are running operations in your company your main goal is to ensure everything happens smoothly and in a timely manner. Designing solutions directed to optimizing processes and making the best use of employees’ time are therefore of high business value. But to build such a solution one must first understand the variables and parameters that contribute to the problem. As you can imagine this is where having a data scientist onboard becomes very handy.
One of Itility’s partners is a company supplying photolithography systems commonly used for manufacturing chips. Their software engineers are utilizing a large amount of code on a daily basis to bring the best software solutions for the clients. Using a central repository for all of the available code means that compilation and build processes can take a substantial amount of time (up to several hours).
The goal of our hackathon was to optimize the software build duration by applying data-driven solutions. The participants were given a dataset which included a number of variables related to building new software, recorded during the months of May to August. However, only the May dataset included information about the number of winkins and translations. So the first challenge was to predict numbers of winkins and translations for the other months (June to August).
Once those values were calculated they could be included in the original dataset and used as training data for the main challenge. And there was a twist! If a group didn’t manage to predict those values themselves within the given time they could continue to work on the main challenge without it, or they were given a possibility to buy them from me at the cost of + 0.2 root mean squared error (RMSE) on the final score. A decision worthy of the good strategist!
So what was the main hackathon challenge? Participants were to predict (you guessed it..) the duration of the builds in August based on the provided variables from June-July. That means that they had to identify key predictors of the duration within the dataset and base their model on those parameters. The final score was going to be calculated as RMSE plus score from our famous applause meter.
It was time to get started!
The room split into small groups all set out to win. Time is always at the essence in our challenges so many of the groups decided to take an easy road and focus on the numerical values within the dataset and save precious time on re-assigning the data. Those who decided to take advantage of the entire dataset had to first map the non-numerical variables into numbers of integers. On the one hand this means more time spend on data preparation, and on the other more variables of potential significance. Noteworthy, even the groups who focused on numerical variables had to deal with data cleaning and account for NAs and missing values.
Another assumption teams decided to make was the linearity of the data. As mentioned in the title it is a regression challenge and a linear model is the easiest one to start with. As many participants explained even if the linear model is not perfect, it is easier (and faster – remember the clock is ticking!) to tweak and try to adjust it rather than explore other regression models.
Time passed very quickly and teams had to consider if they were going to buy the translation and winkins predictions to get ahead of the others. Interestingly, none of the teams chose to do so and either relied on their own translations and winkins predictions or used only the initial data set for their final prediction.
But was it worth it?
In the end winning solutions involved using only numeric values from a dataset in combination with an XG Boost method. To improve their model this team chose to decrease the influence of the outliers on their model. In order to do so they used a logarithmic approach to scale down the values of the target variable before including them in their model. Once the model was complete, outliers were scaled back for calculating the predictions. Together, this solution resulted in an impressive RMSE of 0.8.
There were a couple of ideas circulating in the room that, even though they didn’t result in high scores, were very interesting and with additional time would be close contenders. One such solution involved using random forest. The main benefit of this method is that it can incorporate non-numerical values without the need of translating them into numbers.
The overall conclusion of this hackathon was that the most important step in all of the challenges is to understand the data. If you know what you are dealing with you are on good track!
Want to hack along during the next Meetup? Sign up for the upcoming hackathon.