Starting a data project by building a data lake from scratch is a lot of work, making it expensive and prone to errors. Building the technical foundation while under pressure to deliver data use cases can result in short-term focus, increasing the chance of making mistakes that can be quite costly in the long run. Or, for example, databases and servers being set up without thorough monitoring.
Using standardized building blocks is a way to solve this problem. This is already common practice in the cloud world, where we see a catalog-approach of semi-finished products such as applications, databases, middleware, storage, computing or network products. Giving you the freedom to pick-and-mix your custom solution together. Why not use this approach for data projects?
Our Data DevOps team has developed many standard data building blocks. They are tried and tested, based on our experience and lessons learned from quite some data projects. From that experience, it became clear that although the data for every project is unique, the process to yield value is surprisingly similar. Comparable to a factory, the different steps in this process have an optimal sequence. We have cut up these steps in framed units to create standardized building blocks, enabling us to move fast in a controllable way.
This accelerates the implementation of data use cases, while continuously delivering high availability and performance. For example a cloud-based scalable data processing platform to load and extract data. You have got at least four building blocks right there: cloud-based databases, either for storing large amounts of raw data or for rapid data I/O, data pre-processing pipelines, the data science flow, and a security and data governance model.
Let’s zoom in on an example project: StreetWise project. Together with TNO, we created a scenario-based database with real-life traffic situations, by data-mining many thousands of miles of data via complex algorithms – to be used by companies to test software for self-driving cars. As the performance of these cars is directly reliant on the data quality, mistakes in gathering, processing, or analyzing this data could have disastrous consequences.
Especially in the case of the TNO-project where the data is used to teach self-driving cars, it becomes clear that there can’t be any compromise on quality. That is why we used the data governance and security building block when architecting the platform. By design, the sensor data and mined scenarios are stored separately from the client-facing generated test cases. Additionally, the engineers and data scientists have read/write rights on the environment to perform their work, but cannot accidentally break the platform, as these rights are restricted. This allows the team to focus on their core responsibilities, without being limited or distracted by data security.
Other examples of those building blocks in the standardized data factory are an Azure-based database that we can deploy almost instantly; data ingestion pipelines that ensure a smooth transition from raw data to use-case-specific data; or an Analysis Service server for fast access to the data sets. All these building blocks come with monitoring and performance controls. And by using a software-defined infrastructure, we constantly keep the building blocks up to date using the latest insights. Since in DevOps, the things you build (development) are also your responsibility in terms of support (operations) – you better make it right from the start.