Written by:  

Michel Harren

Building a data pipeline

How to get value from your data? Don’t start with building a Data Platform!

Data is created continuously. By people, by machines, by processes. And the rate at which it is being created grows as well. All this data could be turned into value, but where do you start on this ‘data journey’?
In short: don’t start with the technology. Start with the value.
This blog explores why jumping straight into building a data platform is the wrong approach, and offers a more effective path to unlocking the true value of your data.

DataPipeline_Arrows_01-01DataPipeline_Arrows_01-01DataPipeline_Image_01-2DataPipeline_Image_02-1DataPipeline_Image_01-2DataPipeline_Image_01DataPipeline_Image_01-2DataPipeline_Image_01-2DataPipeline_Image_01-1DataPipeline_Image_01-1DataPipeline_Image_01-1

Why you don't need a data platform (yet)

A data platform is an infrastructure to efficiently collect, store, manage, process, and analyze large volumes of data from diverse sources. It provides tools, services, and capabilities for data integration, storage, governance, analytics, and visualization. If you are starting your data journey, this could be a lot to deal with. Making it nearly impossible to decide where to start. So, don’t start by building a full data platform. Start small, with just one use-case and one so called “data pipeline”. Below we discuss several reasons why. As you keep adding data pipelines, your data platform will be created step by step. With minimum risk and maximum value.

There are many platforms to choose from

Data platforms come in so many different shapes and forms that it is too risky to just choose one without first having insight into where value could be created. The best way to start, in our opinion, is to work backwards. Talk to the people in your organization that are already creating insight from data. You could be surprised who they are and what they already do with their data. It could be an engineer designing your product who gathered usage patterns from your customers. Or a supply-chain analyst doing analyses on shop floor data trying to understand what the factory bottleneck is. Or a marketeer who is turning web shop user clicks into valuable insights.

These people can explain how they are creating value today, why it has value for them, and what additional value they need to get out of data. For the engineer it could mean taking more data sources into consideration. The supply chain analyst might lack a connection to your factory data system. The marketeer might need to connect to a deep-learning toolkit. Such user-stories help you understand what the business needs, so you can define the requirements of the data platform. And moreover: it helps you understand how the data platform could pay for itself.

You don’t know where the data comes from (yet)

Talking with the people who are already working with data, teaches you more about where the data comes from. And how data flows through your organization.

For sure you are going to encounter the complexity of collecting the data, combining it, and transforming it into a form that makes it possible to run analyses. Data collection is a challenge. Data sources are spread throughout your organization, across networks, vendors, and technology stacks. Some data is derived from other data, making it unclear who the actual owner of the data is.

Knowing what connections are needed, what the actual data sources are, and who the owners of the data are, helps you understand what the platform should be capable of. It helps shape the requirements.

You don’t know how data is processed (yet)

Data platforms are not just big silos of bits and bytes. It is not just about storage and secure access. Data platforms need compute power and storage throughput as well. Understanding how data is currently processed, how it flows through the organization, and what future needs are, helps you better understand the needs of data consumers.

Mostly the initial business value from data comes from basic visualization and dashboarding, as these are easy to create with available data and tools. However, listening to the stories of the people already processing data will create new insights. Imagine the engineer who is shaping your products of the future, who needs to connect data from specific mechanical or physics-based models. Or the supply chain analyst who needs image data processing to enhance insight and predictability. Or the marketeer who needs deep-learning techniques, which need much more processing power.

If these hidden requirements are not incorporated into the platform architecture from the outset, it becomes challenging to integrate them later. Gaining insight into how data is processed, now and in the future, will significantly improve your requirement list for a Data Platform.

You don’t know how to make data actionable (yet)

The platform itself is not the goal. The goal is to deliver insights that people can use to make decisions in their day-to-day business processes. Gathering data into a mobile phone app leads to different requirements for the platform than combining data from the shop floor for management reports with real-time scenario planning of supply and demand.

For the mobile phone app, you need to make sure the development team can leverage data through micro-services and more traditional databases. For management reports, you need to connect resource-hungry client applications with data that can be queried in milli-seconds.

You don’t know how to make it operational (yet)

As use-cases prove to be successful, users are going to demand continuity and consistent performance. Here you learn that a platform is not just data and compute power. It needs people to run it and develop it further. It introduces new roles within your organisation such as data owners and data stewards. Do you currently have people who can debug issues with ingestion of data? Who is going to update the software? Do you have the knowledge in place to develop new features?

The data platform is going to ask for a new approach, new skills, and new expertise. You want to make sure this team is in place with the correct capabilities at the right time. Alternatively, you will either end up with an unmanaged, insecure, platform or with shadow IT.

You don’t have your business case (yet)

Data platforms can be quite an investment. Even though it is important to experiment and learn, building a platform without knowing where actual value will be created is a guaranteed waste of money. By starting with the (business) users and translating their needs into value to be created, you know what type and size the platform should be and you prevent making it too big, or not fitting with the needs of the business. This is important, as the platform will need business owners that are willing to pay for it. If you can’t show value yet, no business owner will have real commitment to fund a data journey.

 

Where to start?

So far, we have discussed why you should not start your data journey by building a data platform. But where should you start then? This is a question many decision makers struggle with in their effort to create value from data. Everyone understands that good insights can help forecast customer demand, improve quality, or predict maintenance for example. But at the start, it is mostly unclear what exactly is necessary to come to this value. So, where should you start?

Gather opportunities; get on the shop floor

Your journey should start with gathering opportunities. If you are coming from a data or an IT background this means finding the right people in the business who understand the business challenges, priorities, and opportunities. Talk to the teams that are close to the customer, and to the people involved in logistics, support, and product development.

A good starting question is to ask where they see room for improvement. Do not bring data to the table immediately. Just ask questions to understand their daily struggles. What are they doing to improve their quality, increase performance, and reduce downtime or defects? What data do they lack? Then start to listen and stay curious and keep asking 'why' a lot. It might sound challenging at first, but very soon you will learn that there are a lot of opportunities floating around in your company.

The next step is to choose the opportunity that can make the most difference. Our approach is to create a ranking matrix based on six different criteria.

Ranking opportunities

After gathering opportunities, a choice should be made on which opportunity to start with. Which data pipeline will be built first. This decision should be made with all stakeholders so that there will be commitment for the way forward. And to prevent everyone will build their own data silo. The ranking should be as objective as possible. Below we discuss criteria for ranking opportunities.

Express value in numbers

To make ranking as objective as possible, the value of an opportunity should be as specific as possible. Including numbers or percentages. “Higher quality” or “less defects” is not specific enough. Be as precise as possible. Can you reduce downtime for your customer? Define how much it currently is, and how much it could be reduced. What percentage of machine failures could you have prevented last year? How much money would that have saved? How much more time would your colleagues have had for other work and what is the value of that time? Describe the value in numbers, as this helps in comparing value across many opportunities.

Data engineering and data science resources

Building a data pipeline comes at a cost. A future-proof architecture requires considerable time from data engineers and data scientists. The value to be created should obviously outweigh these costs. Opportunities that benefit from a repeatable process, embedded in the organization, using their data pipeline every day, get a higher score than those that are more ad-hoc or cases based.

Data availability and flow

Without actual, good quality, data to build a pipeline and model, there is no point in starting. Not all data is easily accessible or of good quality. Some data could be outdated, duplicate or of an untrusted source. Availability of good quality data is an important criterion.

A lack of good quality data doesn’t have to be a problem when ranking an opportunity, because once you know you need data, you can work on making it available. But when you want to start soon and get results quickly, at least some (good quality) data should be available.

Also, the ease of data flow should be assessed. Imagine fetching unstructured excel files from someone's email folder daily, as a data source. Without a clear and standard data inflow, you can’t build a reliable solution that creates value.

Availability of domain experts

Domain experts understand the business. They have customer insights, understand challenges on the shop floor, and can assess correctness of conclusions from data. The team building data pipelines needs to have easy access to your domain expert(s). Preferably as part of a multi-functional team. This way, they can immediately ask questions, interpret the data together, and quickly review results. Which leads to results faster. Therefore, opportunities in which a domain expert is available will rank higher.

Feasibility of implementation

To assess the feasibility of opportunities, a team of engineers, scientists and domain experts sit together and discuss the relevant opportunities. If there are too many hurdles to quickly show results this is an important dissatisfier and the opportunity should be ranked lower. Sometimes the monetary value is great, but low feasibility early in your data journey can remove all enthusiasm and motivation. Not being able to prove value quickly reduces team motivation and stakeholder enthusiasm.

Scalability and operationalization

The more scalable and repeatable an opportunity is, the more value an investment creates over time. Predicting maintenance is a daily process that needs a data pipeline continuously. Thus, such a solution should be ranked higher, as it can be embedded into daily operations. This seems obvious, but there will be plenty of cases that only need a one-off data analysis.

Opportunities that need a data pipeline which can be reused for other opportunities, rank higher because the investment in the data pipeline can be spread over several opportunities.

Discuss the outcome and choose

After ranking based on these criteria, the outcome is best shown in a diagram as shown below. Which should then be discussed with the relevant stakeholders.

DataPipeline_Image_02-1

Becoming data driven can have a significant (positive) impact on your organization. But it also means the way of working will change. Therefore, discussing the ranking with all relevant stakeholders is important. Real commitment not only depends on a good solution, but also on the acceptance by stakeholders. Involving them in all steps of the data journey significantly increases the chance of success.

With this approach you'll have made an important step into your data journey. Next step is to execute and show the first actual value.

 

Building a data pipeline

DataPipeline_Image_03

When starting a data journey, it is difficult to plan all steps. There are many unknowns that you only uncover when you have started the journey. Best practice is to get started by creating a representation of the data pipeline and marking the unknowns. And then, step by step, develop the data pipeline. When some parts are risky or unclear, just start by removing uncertainty and keep going.

In some cases, this uncertainty comes from the data quality (incomplete, difficult structure, or binary formats). In other cases, it is a proprietary technology (e.g. domain-specific data modelling tools). For all such issues there is a solution, but you should not deal with them all at once.

Create an MVP

The schematic processing pipeline serves as a to-do list and a progress overview. Depending on the size of the team, people work in parallel on several aspects of the pipeline. One team member works on the raw data ingest from other systems, while another team member works on a first data dump (the floppy in the visual) to transform the data. The goal of the first step, or sprint, is to show that all steps are working separately. Including a first demo to end-users; a Minimal Viable Product (MVP).

DataPipeline_Image_04

Having the first parts of the pipeline working gives the team the right boost of confidence that things can be done. Still some serious effort and optimism is needed to create a full operational workflow, but after one or two weeks it should be possible to give a first demo to the (business) stakeholders showing that progress is being made and that value can be generated. It makes potential value more tangible.

Connecting the dots

The next step is to connect all separate parts. With the data ingestion working, the results can be put in the right place (e.g. database or a file system). The code that was developed to transform the data dump can be connected to the ingested data. Generally, this is a minor step. But, with real (or new) data coming in, the model should be (re)tested with the end-user. Does the user recognize the insights based on the real data?

DataPipeline_Image_06

After a demo, stakeholders sometimes think ‘it is almost ready’. Even though enthusiasm is important for the success of the project, it is also important to note that there are still several steps necessary to create an operational workflow that is reliable and repeatable without manual work.

Making it reproducible

An operational workflow must cope with a constantly changing context. Data changes, requirements change, and users want new insights. This is manageable by automating deployment from a source code repository using Continuous Integration and Deployment (CI/CD).

DataPipeline_Image_07-1

With CI/CD, new code is continuously developed and deployed. Sometimes for new functionality, sometimes for necessary improvements such as making the solution more resilient or more secure. To reduce the amount of effort for improvements, benefiting the time available for new functionality, the best practice is to use standard building blocks and guidelines when architecting the platform. These could be Data governance and security building blocks that guarantee the engineers and data scientists have read/write rights on the environment to perform their work but cannot accidentally break the platform. This allows the team to focus on their core responsibilities, without being limited or distracted by data security.

Other examples of building blocks could be Cloud-based databases that can be deployed almost instantly; data ingestion pipelines that ensure a smooth transition from raw data to use-case-specific data; or an Analysis Service server for fast secure access to the data sets. All these building blocks come with standard monitoring and performance controls, which eliminates the necessity to repeatedly design a solution for these basics.

After new code is developed, and fully tested, it is then deployed to a data platform. As soon as the developed code is deployed, the development loop is closed leading to a new version of the data pipeline. Any changes to the software will result in a new version of the production system. Older versions of code and models are kept as history and the possibility to roll back quickly in case issues arise.

Scheduling the workflow

Automatic deployment of code is an important aspect of successfully managing a re-usable data pipeline. But this won’t make it run by itself. To deliver weekly, hourly, or even real-time value, every step needs to be scheduled. Generally, there are two possibilities for this: time-based scheduling or event-based scheduling (processing based on incoming data). Both will trigger the flow of data toward insights to the user, and the pipeline automatically creates value. Without manual intervention.

DataPipeline_Image_07

Taking it fully operational

There is one more step for successfully running a data pipeline that continuously creates value for its users: monitoring. The further the pipeline is developed, the more business critical it will become. So continuously monitoring all parts of the pipeline for quality, completeness, and potential problems is required to create a full operational data pipeline. In the visual, this is indicated by the cameras on top of the processing steps.

DataPipeline_Image_08

The optimal team setup: pressure cookers

To conclude this whitepaper, we would like to share our view on the optimal team setup for data pipeline projects.

It is our belief (and experience) that the start of a data journey needs quick value. The optimal team setup is thus: a ‘pressure cooker’: a small team, with a clear goal, and a challenging deadline. This creates focus and a drive to succeed. Incremental steps of small successes are much more motivating than long lasting projects with one big result at the end. A pressure cooker works on opportunities in small steps which must prove that value can be created before further investments are made. 

Our projects have taught us that a standard agile approach like Scrum does not always help in keeping speed. So, we use a mixed agile approach where we balance results and process. We draw out the data pipeline schematically and mark progress on each ‘block’. This helps explain the necessary steps to the team and to stakeholders. Also, it clearly indicates what steps we need to take to get to a reliable solution that continuously adds value.