How do you do Data Orchestration?
This is part three of a six part series on ‘Simplifying Data Orchestration.’ Expertise is not found by using complexity, but in the ability to take a complex topic and break it down for broader audiences.
Introduction
In the first two articles of this series, I answered ‘why does data orchestration exist?’ and ‘what is data orchestration?’ I mentioned the common discovery questions asked when evaluating a new tool, product, or way of doing things: who does data orchestration, what, where does data orchestration happen, when does data orchestration happen, why, and how? What I’ve noticed is that these questions are difficult to find concrete answers to in data orchestration that anyone can understand. That is why this series was created. In this third installment, I’ll answer the “how” question.
The “how” question is arguably the most difficult one I’ve tackled thus far. There’s more than one question beginning with how. How do you do data orchestration? How does data orchestration work? How does data orchestration get implemented? Many articles tend to focus on the first of those three questions. While it's important to dive into the details and provide excellent documentation, this creates a narrow focus on a particular feature or process in a single article. However, in order to cover everything, you either read the entire documentation or countless separate articles. That’s the difficulty of the “how” question: scope.
This article doesn't have the narrow scope of a tutorial or feature explanation, where you can follow along and do it yourself. It's not written so that only seasoned data engineers can understand it. Regardless of your technical ability in your data role, you should be able to answer that question after reading this. In order to do that, I can’t make too many assumptions of what you may or may not have for technical ability.
Another layer of difficulty is that there are several different data orchestration tools. (In case you forgot, you’re reading this article on Shipyard’s blog, haha). While I’ve tried my best to remain tool agnostic within this series, answering the “how” question requires some deviation from that. The reason many different data orchestration tools exist is because they each do it slightly differently. Keep that in mind. While I can tell you how to do different parts of data orchestration, it looks different depending on what tool works best for you or your organization.
Now that I’m done rambling off all the context and disclaimers, let’s dive into the interesting parts, shall we?
How do you do data orchestration?
If you’ve read the first two articles of the series, you’d know that data orchestration is the control tower for your data pipeline. It's the central place to control your systems and processes. Without it, you can still have separate tools that “talk” to one another. However, there’s no central point that allows you to zoom out and see the whole pipeline from start to finish. Data orchestration is not always doing something entirely new, but enhancing what you’ve already got by providing a higher level of visibility.
So, how does data orchestration enhance what you’ve already got? I’ll separate it into three different parts:
- Data movement
- Tool communication
- Issue alerting
First, let’s define a couple of terms that are Shipyard-specific: Vessels and Fleets. Nothing annoys me more than reading introductory material that doesn’t define key terms! For more of our wonderful nautically-themed terms, see here.
Vessels:
A Vessel is the "unit of work" in Shipyard. It's an individual script designed and built to accomplish a specific job. Vessels can be built for any purpose and come in any size, shape, or speed.
Vessels are run in independent docker containers. Underlying code, environment variables, and packages of one vessel can’t affect another
What makes Shipyard unique are the low-code options. Vessels can either be built with code or with an open source low-code Blueprint. These Blueprints contain the design specifications that determine what that Vessel will do. The code that powers these Blueprints is open source, and is available on our Github. The settings and options required when adding a new Vessel differ based on whether or not you're using code or a low-code Blueprint.
Shipyard offers 100s of different Blueprints. You can also create your own custom, reusable Blueprint. Custom Blueprints provide a way for technical users to create a reusable piece of code with organization-specific logic. When using a Blueprint, users are only required to provide a few key inputs. All code, package dependencies, and environment variables are abstracted away from the end user.
Fleets:
A Fleet is a collection of multiple Vessels working together to tackle one larger job. Fleets are the equivalent of a Workflow or DAG in other orchestration tools.
Fleets can contain any number and type of Vessels, including a mixture of programming languages and Blueprints. They can be built in either the Visual Editor or the YAML Editor. Changes made in one editor are synced to the other.
1. Data Movement
TL;DR - Move data from one vessel to another in your pipeline utilizing ephemeral local file storage. Get data from integrated tools utilizing APIs.
Some Vessels generate or download files that are passed on to other Vessels in the Fleet. In other words, this is your data moving through the pipeline.
A Vessel can represent many different entities: data platform, data storage, ETL tool, BI Tool, email system, and much more. Each of these entities can be set to perform a specific task. As the data moves from one vessel to another, it can change in size, shape and format, depending on what the Vessel is set up to accomplish.
When a Fleet runs, it creates an ephemeral local file storage. An ephemeral file generated by upstream Vessels is available to every downstream Vessel. Files can only be accessed by other Vessels while the Fleet is running its process. If the same Fleet is run twice or more in parallel, each instance has its own separate file storage.
This is equivalent to running individual scripts one after another on your local computer. Creating two files with the same name in a Fleet will result in the most recent file overwriting the oldest file.
By default, all files generated are stored in the home directory, /home/shipyard/ and all scripts are run from this same directory. To access any generated files, you don't need to include the home directory. However, if you have chosen to create a file in a subdirectory, that subdirectory structure must be included to access a file. To see a list of all files available in a Fleet, follow this guide.
Once every Vessel in the Fleet has finished running, all files are immediately wiped from the Shipyard platform. This setup allows Vessels to run independent, modular tasks without uploading/downloading files to/from an external storage solution.
API (Application Programming Interface)
An API acts as a bridge between software applications so they can communicate and collaborate with each other. It facilitates data exchange and functionality sharing between various tools and platforms. APIs enable developers to leverage pre-built functionalities and create modular, maintainable systems. To get information from one tool or platform into another using an API, the client application sends a request to the server application. The server processes the request, retrieves the required data, and packages it in a structured format. This data is then transmitted back to the client application.
Shipyard integrates with many other tools by using API functionality built into the low-code Blueprints. In order to use the API calls in a Blueprint, you need an authentication method or access token. We have authorization guides for each of our integrations. You can see an example of one here. Shipyard also functions that can be called on using an API, which you can find in the Shipyard API documentation.
2. Tool Communication
If you wanted to know what part of data orchestration sets it apart from all of the other categories of tools, it’s communication. Holistic communication facilitates the fundamental interconnectedness of all things. How does data from an app that asks users to rate how much they like potatoes relate to a Slack notification that your month over month revenue for parkas sold in Canada went down? I won’t pretend to understand the business of this hypothetical company, but the point is, it could be relevant. You need the holistic view. You need the whole picture.
Workflow Building
TL;DR Build workflows as a fleet to allow individual vessels to communicate with one another. Utilize other resources like APIs and Webhooks to set the conditions you want to trigger the fleet to run.
When an upstream Vessel finishes running, it returns a status code to indicate if the Vessel was successful or not. The paths coming out of the Vessel are evaluated using AND logic. If all paths evaluate as true, Shipyard kicks off all downstream Vessels that are connected. If a path evaluates as false, all downstream Vessels will be marked as incomplete. Success of the vessel is not the same thing as success of the code within it. Vessels can be successful even when the code in them fails. For example, you can set up a vessel to find new data in the last week from whatever your source is. If no new data is found, you can use that error as the “success” criteria to start the next vessel in your fleet.
There are several different path options in a workflow/project. If a Vessel is part of a converging path, it waits until all upstream Vessels have completed to begin its task. If any of the converging paths are evaluated as false, the downstream Vessel will not run. You can see some other path configuration in the image below.
Triggers and Webhooks
Triggers are the logic that determines when a fleet should begin running. There are currently four different types of triggers supported in Shipyard:
- Schedule triggers
- On demand triggers
- Webhook triggers
- API triggers
Scheduled and on demand triggers are fairly straightforward. In a scheduled trigger, you set up the cadence of how often you want your fleet to run. In an on demand trigger, you run the fleet directly. On the top navigation pane for every Fleet, there is a Run Now button. Clicking this button schedules the fleet to run immediately.
API triggers execute a Fleet in Shipyard by running a POST request from any service. API triggers can also dynamically pass environment variable overrides to a Vessel at runtime. This builds a Fleet that acts as a "shell" for your Vessels. The shell creates the ability to pass in different variables to change the behavior during each run.
I realize this is starting to get into the technical weeds. If I’ve lost you, no worries. If not, I won’t leave you hanging. To see a full walk through on how to do this with a practical example, check out our Run Now with Custom Parameters post.
Last but not least is Webhook triggers. Webhooks allow real-time communication and data exchange between different applications or services. They enable one application to send automatic notifications or updates to another application whenever a specific event or trigger occurs. The sending application (source) monitors events or actions. When a particular event or trigger happens, such as a new order placed, a payment processed, or a user registration, the sending application initiates a POST request to the Webhook endpoint URL.
Webhook Triggers enable you to programmatically execute a Fleet in Shipyard by running a POST request from any service. When you run the POST request, your Fleet will be scheduled to run immediately. You can also use Webhooks to dynamically pass data to your Vessels at runtime.
Webhook triggers differ from API triggers in some important ways. Anyone that has access to the endpoint URL can use it. No authentication is required. Webhook triggers can accept any kind of data, while API triggers require a JSON format. Lastly, data received from the Webhook must be interpreted by custom code, so you can’t use it as an input for a vessel in the Blueprint library.
3. Issue Alerting
TL;DR: Set up conditions to let you know when your fleet fails a run, or if there are other data quality issues.
Notifications
Notifications provide an automated way to receive updates about the Vessels and Fleets. Both successful and unsuccessful runs can trigger a notification to be sent.
In Shipyard, notifications sent after the final status has been determined, not when the error happens. This makes more sense in the context of guardrails. Guardrails are great for setting up reasonable time limits between attempts or the number of attempts you want to make running your vessel or fleet. For example, if your Vessel has guardrails set to 3x, an error notification would not be sent until it had errored out all 3 times.
Notifications are always a required field because someone should always be aware if the fleet fails! That said, different teams require different places to be notified. You can set up notifications to be delivered to email, Slack, Microsoft Teams, or PagerDuty. While Shipyard can't tell you exactly how to run your people processes, it is important that your data team sets a precedent for how they will respond to issues and alerts.
Error Handling
When you receive a notification that an error occurred, what happens next? Check the fleet logs! Every Fleet Log shows a list of each unique Vessel log generated by the Fleet's voyage. By clicking on an Vessel ID, you can view more information.
The output section of an individual Vessel Log will show you:
- Environment variables that were set.
- Commit hash of the code that was cloned (if using a Git Connection)
- Any data that your script printed to
stdout
.
Version Control
Version control is a system and method to manage changes to a set of files over time. It tracks and records modifications made to fleets, allowing developers to collaborate, manage code, and revert to previous states if needed.
In Shipyard, fleet versions are represented by a card. On a version card, you will see a version number, the email of the version creator, and when it was created. When you click on a card, you can see the fleet version in YAML code.
Cards have different actions that you can take:
- Compare to latest
- Revert to this version
- Create new fleet from this version
Version control can be incredibly powerful when handling errors. In Shipyard, changes between versions are highlighted. That way, you can see the differences between the last working version of your code and the most current version.
Conclusion
Now that you understand the high points on how to do data orchestration, stay tuned for the rest of the six-part series on simplifying data orchestration. We'll dig into even more of the important discovery questions. In the interim, check out our Substack of articles that our internal team curates weekly from all across the data space.