Why Does Data Orchestration Exist?

TL;DR - to centralize control over your data pipeline.


This is part one of a six part series on ‘Simplifying Data Orchestration.’ Expertise is not found by using complexity, but in the ability to take a complex topic and break it down for broader audiences.

Introduction


There are different phases of topic discovery and adoption that individuals and businesses go through. Some common discovery questions are: who does data orchestration, what is data orchestration, when does data orchestration happen, where does data orchestration happen, why does data orchestration exist, and how do you do data orchestration? These are hard to find concrete answers to when evaluating a new product, tool, or way of doing things. In this series, I will answer most of those questions in relation to data orchestration and Shipyard.

Starting with Why? Identifying the Pain


Before I get into what data orchestration is, it’s important to understand why it came to exist in the first place. Tool categories become tool categories because they ease a particular pain in a unique way that others tools do not. So if you’re adopting a brand new tool that is more than an upgrade or replacement, it’s important to understand what problems this new tool solves.

Story Time


As a former analyst, one of the tools in my arsenal isn’t so much a tool, but a skill - data storytelling. So let’s dive into a realistic story to highlight the pain that drove the creation of data orchestrators in the first place. This story follows the journey of an analyst, as they dive deeper and deeper into the thorny weeds of data pipeline management.

There are many tools in each category that would work in this story. I have remained tool agnostic so that you can insert your own tool of choice where relevant.

Chapter 1: DB to Dashboard


The story starts off in a simple fashion. There is a dashboard connected to data extracts that are powering the visualizations. I'll leave the dashboard part aside for now. The main focus is to set the schedule from within the user interface of your data visualization tool. This updates the extracts by using SQL to query the database/data platform that stores the data. Nothing is too crazy, for now.

Chapter 2: Transformation


In order to keep uncovering new insights, keep track of the growing number of one-off SQL queries, shape data for visual aggregation, and track metrics, the analyst adopts a new transformation tool (and a new skill!). Data transformation tools streamline the task of modifying data values and formats. By enhancing efficiency and automation, these tools convert large volumes of data with speed.

Now, we have a happy analyst with well-defined business metrics and ways to shape the data. They can slice and dice it how their end users would like. However, the analyst maintains testing and execution schedules in the transformation tool interface. Below, there are now two calendar icons. These represent different areas to schedule, track, test, alert, etc. For now, the story seems positive for the analyst.

Chapter 3: Other Data Sources

Now, the end users have asked for additional metrics. The analyst realizes that the data to create those metrics comes from a few different external applications and sources, none of which are in the database. They adopt a new tool that ingests data and schedules the cadence and rates of that data ingestion.

Things are no longer so simple. The analyst has to think through the data modeling. There are also the downstream implications of ingesting, normalizing, and cleaning these new rows of data. In this story, even though that ends up as a success, there are now several calendars of scheduling and alerting to manage. It's becoming tiresome to debug issues that come up when errors occur.

Chapter 4: Machine Learning / Predictive Modeling

Unfortunately for the analyst, the story doesn't end at tiresome. The end users want to incorporate predictive analytics into the dashboards. Doing those analyses are not within the analyst's role. But, they still need to get the output of the predictive models to present to their stakeholders.

So, the analyst interfaces with the data science team. However, the data scientists work off of an entirely different database that the analyst is just now learning about. It's essentially a black box given the current setup. Nevertheless, the analyst figures out what data the data science team will need. They set up a cadence to shift the data from one data platform to the other.

Some questions arise: How do I get this predictive data set back into my data platform? Does it also need transformation in order to get visualized in the dashboard? How do I know when the predictive data gets refreshed? Should I know more about all of this process than I do? Am I just someone who manages calendars, schedules, and hunts down the cause of errors? If these questions feel relatable, I understand. The truth is, the analyst in this story is loosely based on some real stories I've both heard and lived through.

To top it all off, the consumers of the dashboard are starting to question the quality of the data in the first place! Isn't providing reliable insights the analyst's MAIN job? If only there were a way to manage all of this from one centralized place.

Epilogue

And that, friends, is the why. Why does data orchestration exist? To solve this very viscerally real problem, and other variations on this theme. Data orchestration was created to fix the pain of not having a centralized control tower over data pipelines. In essence, we created Shipyard to help Data Engineers, Data Scientists, Analytics Engineers and Data Analysts focus on the parts of the job that they're best at. Solving problems with data.

Benefits

  • It saves time
  • It decreases the complexity of having to use multiple in-tool settings
  • It increases visibility for more members of the data team
  • It increases data quality and trust
  • Did I mention that it saves time?

I like analogies. If you’ve stayed up to date on the social media world, you know it is fragmented  - Linkedin, Twitter, Threads, Reddit, Mastodon, Bluesky, to name a few. Wouldn’t it be nice to have some mega platform to browse all the content from these places at once? That’s what data orchestration is for your data pipeline. See all the “feeds” from your data in once place.


Conclusion

Now that you understand why data orchestration exists, stay tuned for the rest of the six-part series on simplifying data orchestration, where we'll dig into even more of the important discovery questions. In the interim, check out our substack of articles that our internal team curates weekly from all across the data space.

Ready? Get started with our free Developer Plan now.