The science of data science

Written by Kjeld Kerssemeeckers | Dec 31, 2018 10:41:43 AM

In academia, has a paper ever been published without a detailed methodology? Or without having been peer reviewed? The answer to those questions should be: no. Among the golden rules of science, traceability and reliability cannot be missed. In the academic world, it is paramount that measurements are valid and that models – in whatever form – consistently deliver correct results. Equally important, others need to be able to track whether this is the case. These common principles of the scientific world also apply to the realms of data science, although they are often not considered. At a time in which more and more companies become organized around their data, the time has come to take a closer look at the traceability and reliability of machine learning models (ML models).

Machine learning tools

The machine learning industry is on the verge of exploding. The sheer number of machine learning tools and frameworks that become available for all phases of the machine learning lifecycle increases every day, and so does the number of data scientists who graduate. The number one goal of a data scientist is to generate useful ML models that add value to business processes, and with that help in the journey towards becoming data driven, digital companies. However, the many tools and methods that exist to work with data and to explore ML models come with challenges in exactly those pillars on which the academic world is based: reliability and traceability.

Traceable machine learning models

The moment we start to do more with data science, and the moment that we start embedding the outcomes of ML models into the organization, is when the need for traceable and reliable ML models increases.

After all, models must continue to function optimally once integrated in – sometimes business critical – processes. Especially when they make automated decisions without human interference. Yet, it cannot always be validated by third parties how a ML model was established. The focus generally is on the result, so when it works – mission accomplished. Therefore, changes that are made to optimize a model often are not registered. However, time comes that the output quality of a retrained ML model deteriorates compared to previous versions, for instance when the data set becomes contaminated, or when the reality that you are trying to automate is changing. All of a sudden it then becomes imperative to be able to track how the model has been established, who created it, and with which tool or framework it has been trained – either to implement improvements, or to make outdated models fit a new reality without being dependent on the data scientist who created the model. Problems can only be solved when we know at what stage in production things went wrong.

Standardization in data science

In short, the difficulty is two-sided: the proliferation of existing machine learning tools, and ‘the headstrong data scientists’ who are used to working in their own way. At this point no universal, standardized approach exists for building, validating, and tracing ML models, while this is necessary to ensure quality. Because what happens when a model built with Tensorflow starts producing incorrect results, while the data scientist who needs to solve the issue only knows how to work with PyTorch or Spark ML? Although ML models in essence are not tied to the framework – since they are based on generic mathematical principles – this does result in a practical difficulty.

It is an issue that requires a proper organization, structure, and collaboration. Similar to software developers, data scientists must be able to see and validate each other’s work regardless of the tools and frameworks they are trained in. It would be a shame if the embedding of valuable ML models is slowed down because not all data scientists are trained in the same tools.

Giants such as Google and Facebook perhaps are the only examples of companies that are well organized in this respect. Originally data driven technology companies, they were once the first to run into this problem and to possess the skills and resources needed to address it – a luxury that resulted in developing their own way of working to ensure the quality of their ML models. For the rest of the world – that only just started to explore the world of data science – these aspects need to be considered at the start to prevent issues later on.

"The moment we start to do more with data science, and the moment that we start embedding the outcomes of ML models into the organization, is when the need for traceable and reliable ML models increases."

Eventually, data science will go through the same process as software development once did. I do not know a single developer who would not have his code reviewed by peers, or who is unfamiliar with version control. Software developers are trained already at school in the right way of working together. Because of this – along with the skill to write code – they possess a collaboration skill that is not yet a given in the world of data science. Looking at the speed with which companies become data driven, and the pace with which new data scientists are trained, this is not a very sustainable situation.

Be ready for the growth that is coming your way and be aware of what is needed from your organization. Enable a level of standardization – at least within the walls of your company – so that it is always traceable how data is used within a model and how the model itself functions, regardless of tools, frameworks, or competence. An open source machine learning platform such as MLFlow is worth considering for streamlining the entire chain of data science within the company. Also embrace a way of working that ensures that each model flows from Test to Production via a pipeline, which includes version control, automated testing on data validation, and peer review of the code.

Well-oiled machine

As we all know, knowledge is achieved through trial and error. Yet, being prepared is half the battle. Look beyond the hype of machine learning, and turn data science into the well-oiled machine that it should be. Regardless of tooling, framework, or competence. It is the only way to maintain the speed that is necessary to move forward.

View full post