Hackathon on improving seedling classification performance through advanced feature engineering

Written by Itility | Feb 7, 2020 7:49:51 AM

In this first hackathon of 2020, a group of Itilians, data science students, and other data enthusiasts came together to sharpen their data science skills. The topic: improve the accuracy of the current classification model used in a project done with Wageningen University & Research (WUR). The model is used to automate visual inspection of tomato seedlings, hence it is crucial to have a highly accurate model.

Improving the prediction accuracy of 90%

We kicked off with an introduction to the dataset and the current Itility model based on a convolutional neural network (CNN) in combination with transfer learning. This model was already able to predict seedling quality with 90% accuracy.

The hackathon focused on improving the previous classification accuracy through feature extraction using PlantCV (an open-source plant image analysis library), and a tree-based machine learning model (XGboost). The group was split into teams with varying backgrounds and experiences. Everyone received a working Python notebook, which included code that handled the data loading, example feature extraction, and an XGboost model to make the classifications based on these features. Each team was then tasked with improving the classification further, using more advanced feature engineering and other smart ideas of their own.

After a short tour through the notebook and everybody logging in to the Databricks environment, the teams were ready to start hacking. The first thing most teams did was to explore the notebook and begin to identify the quick wins. However, this surprisingly took a fair amount of time. To recharge our brains, the teams decided to take a short break and enjoy a nice meal and each other’s company.

Upon returning, a lot of the teams decided to split the work between their members. Some focused on engineering more advanced features from the images, while others went on to tune the parameters of the XGBoost model. Advanced features were extracted from the data by gauging the degree of coloring using top-view images of the tomato seedlings or by measuring the height of the plant using side-view images. Parameter tuning was achieved by a grid search for optimal parameters, such as the learning rate using the validation accuracy.

Feature engineering pays off

After teams finished their tweaked versions, it was time for the teams to present their models. It was great to see the different ideas other teams came up with. For instance, some presented solutions to handle the class imbalance, such as oversampling the minority class, while others solely focused on the main task of feature engineering. Either by concentrating on the extracted features themselves, such as color frequencies in the leaves and stem area, or by using images from other angles (i.e., side-view images). This latter approach, used by multiple teams, proved to be most effective. When the features extracted from those images were included in the XGboost model, it resulted in a significant improvement of the accuracy from 94 to 97 percent.

The key takeaway of the evening was that domain knowledge plays an important role when training machine learning models. The combination of feature extraction from the plant images using PlantCV, together with the XGBoost model outperformed the earlier used CNNs. This became evident when the accuracies were compared. With the highest accuracy at 97,2%, the team using this approach won eternal fame and dinner at our Itility home base: restaurant La Fontana. Another place where they surely take their tomatoes seriously.

Got excited for the next hackathon? Sign up for the Meetup group to get your invite.

View full post