More and more automated machine learning (AutoML) toolboxes are available nowadays. All these AutoML tools help in the search for the best machine learning algorithm to use for your specific data by generating and testing out many models for you. This sounds great in theory, but can these tools replace a data scientist?
The answer is definitely not. Firstly, because modeling is just one of several essential steps in creating value out of data. But you also have to take all the different steps in a data science project into consideration. Formulate the problem from a business perspective, gather the required data, process that data (cleaning, calculating, enriching), and label it. With the data set at hand, the next step is to build a model and tune it toward good performance (many iterations…).
Secondly, a model itself cannot take any actions. For this, you need a data scientist who can serve the outcome of the model in an alert, a dashboard, or an app, for example. And if you are able to find value after all those steps, you have to make sure your solution can be used on a day-to-day basis (deploying in production, monitoring of the data processing and of the model performance). So no, an AutoML tool is no replacement for a data scientist.
However, AutoML can definitely be a nice aid in the modeling step. Modeling takes around 15-25% of the time of a typical data science project. Using AutoML can decrease your ‘time to value’, especially in quick prototype activities. AutoML can be an asset when used to handle the pre-processing (how to deal with imbalanced data, how to fill missing values, what to do with outliers), the feature generation and feature selection, the model selection (Linear models, K-Nearest Neighbors, Gradient Boosting, Neural Nets, etc.), and the tuning of model hyperparameters (for example, number of trees, or learning rate and number of epochs).
This makes AutoML interesting enough for us at Itility to dive a bit deeper into a number of AutoML tools.
Firstly, we ‘threw’ the tabular data of one of our hackathons (Habesha sales activations: predict sales based on historical data) in GoogleML, H2O AutoML, H2O Driverless AI, IBM Watson studio, and Amazon AutoGluon. This delivered quite some cool results: all AutoMLs scored so well that they would have ended 2nd in our hackathon. Meaning that two hours of running data in AutoML yields similar results as having a set of data scientists work their magic in two hours of hackathon pleasure.
Secondly, we used the data of our hackathon with WUR data (Wageningen University) on classifying normal/abnormal tomato seedlings to test the tools with image data. Our data set has 718 normal and 194 abnormal tomato plant images. The AutoML tools GoogleML, IBM Watson studio, and AutoGluon scored a high accuracy of more than 85% with just two hours of training. However, these tools did not correct class imbalance: 97% accuracy on normal, but only 55% accuracy on abnormals – meaning: if you just predict normal all the time, you will already get a high score.
We did evaluate on more criteria (ease of use, input flexibility, output flexibility, deployment flexibility, cost, security). Feel free to contact us if you are interested in these scores as well).
Our summary of the deep dives: AutoML is good for rapid prototyping, to test the water and see if an idea is feasible with the data you have at hand. Of all tools available, there is no one tool that outperforms all others – yet. AutoML is still maturing and developing rapidly. So, we are keeping an eye on those developments in order to use the right AutoML tool for model prototyping.