This is the final article in a six-part series on ‘Simplifying Data Orchestration.’ Expertise is not found by using complexity, but in the ability to take a complex topic and break it down for broader audiences.
Introduction
In the first article in this series, I mentioned the discovery questions. What a journey it has been since then; we’re almost to the finish! This article completes the six question collection: who, what, where, when, why, and how of data orchestration.
Specifically, this six-part series addresses who does data orchestration, what is data orchestration, when does data orchestration happen, where does data orchestration happen, why does data orchestration exist, and how do you do data orchestration.
Before now, it was hard to find these answers when evaluating a new product, tool, or way of doing things. More importantly, it was hard to find answers to these questions that almost anyone could understand.
I’ve enjoyed writing this series, and it’s bittersweet that we are on the final chapter. But, without further ado, it’s time to answer the final question. When does data orchestration happen?
The two types of when
When is a tricky question to answer. It depends on what type of when you are referring to. There are two ways I look at the when question - situational vs functional.
Situational
When should your organization or team use data orchestration? If you’ve been following along with this series, you may notice that the situational when has strong ties to an earlier question, ‘why does data orchestration exist?’ As a refresher, data orchestration exists to solve the pain of managing your data pipeline from disparate sources. Do you have that type of pain? That’s when you should consider data orchestration.
Functional
When do you want to run your data pipeline? Much like a conductor controls the beat of the song an orchestra is playing, a data orchestrator controls the beat of your data. One of the core elements of data orchestration is setting up your timing and cadence via scheduling. If you want more detail on how you can set up scheduling, check out “How do you do Data Orchestration?” The problem is that this only answers ‘how do you set up when?’ The answer I’m after is ‘what do you set up when?’
The Functional When Spectrum
To know what to set up when, you need to do some investigation. Pipelines have endpoints (let’s call them Z points) that are the outcomes of a successful run of the processes in your pipeline. How often do these Z points need to be created, refreshed, updated, etc.? This is heavily dependent on organizational needs and cost savings. Examples of an organizational need are “we need this dashboard to be updated every Monday morning” or “every time a purchase is made from our website, we want to refresh our inventory data.” A cost savings example would be to set an ETL process that takes data from your application to your warehouse daily instead of hourly. How live do you need your data to be?
Why not run your data orchestration pipeline as often as possible to make sure you always have the latest end points? Depending on what tools are in that pipeline, it could get very expensive very fast. This is why it’s important to understand the organizational needs and use of the end points. If a process only needs to be run once a month to satisfy the organizational needs, then you shouldn’t be running it daily. Finding your place on the functional when spectrum is about balancing needs and costs.
Water vs Molasses
You don’t always have to worry about how long a pipeline takes to execute. Small data and simple logic usually means a fast pipeline. But the bigger the data and the more complex the logic inside of the tools in the pipeline, the longer it is going to take. Under those conditions, you should be aware of execution duration. Some pipelines execute like rushing water, and others trudge along like molasses. Imagine you’re working with billions of rows of data, and have complex transformation logic to get through in order to populate reports. It could take an hour or more just to execute this process. Pro tip: don’t schedule a process that takes an hour to execute to run exactly the hour before it’s needed. Give yourself some wiggle room.
As technology improves and processing times decrease, speeds only continue to get better. Some of the pipeline speed relies on technology, but it can also depend on query optimization. This is starting to get into the weeds a bit, but it’s important to understand that speed is a product of both the tool and how well you’re using it. If you want to read more about query planners and compilers for query optimization, I’d recommend this page to start.
Monitoring Execution Time in Shipyard
Some pipelines will take longer to execute than others, but how do you check the timing? Using our own tool as an example, let’s take a look at the fleet log.
This fleet is triggered to run on a schedule, which you can see in the top left hand side. Each of the green bars represents a vessel that has successfully run. When you hover over a bar, you can see the runtime for its respective vessel. Shipyard doesn’t charge you for the entire time the pipeline takes to run; there's a difference between billable runtime and duration. You're only charged for the time it takes the scripts to run on our infrastructure.
Summary
To know when to do data orchestration, you need to know two things first:
- How often are the end products of a pipeline needed?
- How long will the pipeline take to execute?
Once you have the answer to these questions, you can choose which triggers make the most sense for your use case.
That concludes the Simplifying Data Orchestration series! Until the next series starts, make sure to check out our Substack, where our internal team curates articles weekly from all across the data space.