Blogs

Identifying plant species using data science

Written by Itility | Oct 26, 2018 3:42:00 PM

February 7, the first meetup of 2018 in the long range of data science hackathons at Itility. First challenge of the year of course had to be special. It’s like planting a seed in the minds of our data scientists, hoping that it will grow into cutting-edge ideas, new skills, and exciting projects. And we had a dataset that would serve this purpose perfectly!  

Theme of this hackathon was based on Itility’s collaboration with Wageningen University (WUR) on assessment models to measure the quality of seeds by testing the seedlings. Currently this process is mainly done by hand, meaning that every seedling needs to be looked at by a human; a tedious and costly task. We collaborate to automate this process by means of machine learning.

So, this hackathon was about image recognition. Let’s take it from the top:

First, the data:

Every team received two sets of seedlings’ images. First one contained labeled training images, second one unlabeled test images. Given time constraints, we eased the task by providing not only the raw, but also the processed data, including training data converted to matrices of red, green, and blue (RGB) values, as well as converted into matrices of grey values. For plant experts among the readers the species were: Common Chickweed, Fat Hen, Loose Silky, Scentless Mayweed, and Small-flowered Cranesbill.

         

Common Chickweed

Fat Hen

Loose Silky

Scentless Mayweed

Small-flowered Cranesbill

At the first glance they look quite distinct; however, on the population level differences can be less obvious.

Second, the challenge:

The challenge was to develop an image recognition model that will be able to identify plant’s species in each of the test images. And so the coding could begin!

Most of the participants chose to use Tensorflow with the Keras library, a very common method for building neural networks and image classification. Keras library is a neural-network API (https://keras.io/) that can run on both Python and R and allows fast experimentation – a much needed feature in this challenge.

However, when the time is limited and stakes are high (as usual there was a prize!), it’s tempting to take some extra shortcuts to get ahead of the others.

    1. Simplify - The obvious shortcut was to work with the RGB matrices. A team that chose that approach used Random Forest, but rather than using information from all three channels they focused only on single ones. Surprisingly, they got the best results when classification was based on blue channel rather than green. However that turned out to be not good enough to earn them a winning spot with an accuracy of only 54%.

    2. Generate more data – there were quite a number of images in the data set, but some teams thought that this may not be enough to develop an accurate model. Those teams resorted to a method called augmentation, which allows generating “additional” images, by flipping, rotating, and rescaling the existing ones. While indeed that allowed for more accuracy it turned out unnecessary to win.

    3. Build on other models – this is called transfer learning and allows transferring pre-trained models to the problem at hand. Several pre-trained models of various complexities are available (including Google’s Inception V3 model) and all it took was to select the one compatible with provided dataset and teams were ready to go. And indeed, with some fine tuning, this approach paid off and those teams were ahead in the challenge.

In the end the winning team used a rather simple 8-layer AlexNet model – but managed to reach an accuracy of 97% on the unlabeled dataset! And here is an interesting detail – not only did this team obtain the highest accuracy, they were also the only ones not using R or Python, but MatLab.

Overall, another evening of fun competition! But it’s worth noting that it was not always straightforward and for example, many of the groups struggled with overfitting. The best way to control it was to watch out for “overfitting bears” and closely monitor the decay in loss and validate the model using a validation set. 

Upcoming hackathon

Every few weeks we host a data science hackathon at Itility. We'll have data scientists talking about topics like machine learning, predictive analytics, visualization techniques. But more important: not just words, but real acts. During the meetups we will use all kinds of tools (Splunk, R, Azure ML studio etc) to play around with these topics and really create something tangible during the meetup as a joint team effort.

So: no theory, only practical examples and the opportunity to learn from each other. Food and drinks included of course! Sign up for the next hackathon.

Want to be notified of upcoming hackathons? Leave your email in the form.