Blogs

Splitting and classifying car data

Written by Daniel Koops | Apr 18, 2018 2:35:53 PM

Driving a car is a multitasking skill. It requires monitoring the movement of other cars, watching for changing weather conditions, and controlling the vehicle itself. Even though for many of us it seems like daily bread, complexity of it becomes apparent when you try to teach it to a machine.

With the new developments in the automatization of driving assistance and fully autonomous cars there is a high need for creating testing conditions and simulation scenarios. Dutch TNO is one of the leading institutions in the Netherlands to provide such solutions.

To do so, TNO collects multivariable data from a fleet of cars driving through various conditions and settings - and analyzes it to develop simulation scenarios. This includes information about car movement, road conditions, and objects appearing on the road. It is a huge amount of data, and analysis cannot be human-dependent but needs to be modeled to be run automatically.

This is a very exciting, but also challenging task, where TNO data scientists and Itility data scientists work together to create the best algorithm. Also, we decided to pick brains of our hackathon participants on this challenge.
Participants were given a small subset of publicly available data, similar to the data collected by the TNO. It included velocity values collected over 4,5 hours with 12Hz frequency. The task was to detect activities within the data that relate to: slow and fast acceleration, cruising, deceleration, breaking and stop, and separate them into individual shapelets. While the task was to detect those 6 activities the criteria for each of them were not defined, so teams could set them based on their own findings.

That doesn't sound too difficult, right?

Well…. there are some quirks in the data set. First of all, the data set contains a lot of low-level noise, which makes calculating a first derivative (an obvious choice when it comes to determining changes in movement) rather difficult. Secondly, applying too much filtering may cause loss of sensitivity and result in missing some of the events.
Knowing all that, teams were ready to go.

Every team decided to start with smoothing - a wise choice, given the noisiness of the data. While everyone agreed it was needed, the actual method was different almost for every group and included: mean/average over certain number of points, walking average, Gaussian filter, and Savitzky-Golay filter. The last of those methods works by fitting successive subsets of adjacent data points with a low degree polynomial and therefore increases the signal to noise ratio, while at the same time does not distort the signal.

Noteworthy, only one of the teams decided to explore a different approach and tried using BaySeg, an unsupervised spatial segmentation method in Python. Even though, advertised as easy-to-use, this method required a lot of debugging and at the end of the hackathon didn't yield a result.

Others, once done with smoothing, moved on to the next step, which was detecting individual events and dividing them into shapelets. Here, most of the participants relied on calculating local minima and maxima or the first derivative.

Two notable distinctions, and main contenders for the win, were a group that plotted the data and generated categories based on the histogram peaks, and a group that generated identical size chunks of 5 seconds and categorized these based on the differences between them. This approach required some additional testing with a silhouette method for cluster data, but it was nonetheless an inspiring thought.

In the end, the audience gave biggest applause to the histogram solution, making them the winner of the challenge!

Want to join next time? Sign up for our Meetup group.