Hackathon on symbolic regression
This time I challenged attendees to create a model that will predict housing prices in Boston using symbolic regression.
First, the data – The Boston housing data was collected in the 70s and is a set of 14 features which influence housing prices in Boston. Those features include parameters such as: number of rooms, status of the population, and nitric oxide concentration.
Second, the challenge – our goal was to create a predictive model for housing prices in Boston and come up with a strategy to improve living conditions based on this model. The challenge was to use as few of the variables as possible and still get a highly predictive model. The competition winner was selected based on the R-squared (with penalty for using multiple variables) and quality of the strategy proposed for improving housing conditions.
Third, the approach – we were to use a machine learning method called symbolic regression. The principle behind the symbolic regression is to find an optimal formula, rather than the optimal values. The formula is generated by genetic programming; an evolution method which modifies (mutates) the structure of the functional trees.
And so it began! First, we had to choose the programming language. We could choose any programming language and seems those of us who went for Python were off to a better start than R enthusiasts. As it turns out that the R Symbolic Regression library is less developed than Python’s setting our developers back.
From there on, all of the teams chose to take different approaches to tweak the symbolic regression models. Some looked at correlations and selected those variables that could be influenced by the government, or those that correlated linearly, others looked at feature extraction. Even though most of the teams came up with a similar set of variables required for proper prediction (number of the rooms, lower status of the population), the specific models differed in length and complexity and most importantly in their predictive power. The model had to be accompanied with a good development strategy and even though increasing the number of rooms per house was a strong contender, the teams went for more practical approach of creating jobs and increasing safety.
And there was one winner who had it all!
In the end, the main insight of this challenge was that symbolic regression allows for a higher degree of freedom and quick insight into influence of single variables, however its predictive power is generally lower than that of more elaborate models.
An example Python Script
from gplearn.genetic import SymbolicRegressor
from sklearn.datasets import load_boston
import pandas as pd
# get data
boston = load_boston()
data = pd.DataFrame(boston.data)
data.columns = boston.feature_names
target = pd.DataFrame(boston.target)
target.columns = ['MEDV']
# symbolic regression
function_set=('add', 'sub', 'mul', 'div','sqrt','max','min')
sr = SymbolicRegressor(population_size=50000,
# check results
Every few weeks we host a data science hackathon at Itility. We'll have data scientists talking about topics like machine learning, predictive analytics, visualization techniques. But more important: not just words, but real acts. During the meetups we will use all kinds of tools (Splunk, R, Azure ML studio etc) to play around with these topics and really create something tangible during the meetup as a joint team effort.
So: no theory, only practical examples and the opportunity to learn from each other. Food and drinks included of course! Sign up for the next hackathon.
Want to be notified of upcoming hackathons? Leave your email in the form.
Back to overview