DAGs in Data Engineering: A Powerful, Problematic Tool
The use of directed acyclic graphs (DAGs) in modern data engineering is an interesting topic. Because in the right situations, DAGs can serve as powerful tools for mapping out workflows and data pipelines.
However, in data orchestration and management, a tool can only be as powerful as it is productive. And DAG cons can easily outweigh their pros in certain data environments.
To help make it easier to understand why, let’s break down what a DAG technically is, review the pros and cons, and dig into why DAGs aren’t a one-size-fits-all solution in modern data environments.
What is a DAG?
Originating in mathematics, a directed acyclic graph (DAG) is a diagram that visualizes an order-based relationship of events. This is why DAGs are useful in data engineering, DevOps, and data science as an efficient way of mapping out (i.e., graphing) workflows and data pipelines.
At their most basic, DAGs consist of just two elements: points and lines.
- Points, referred to as nodes, represent individual tasks or steps in a process (i.e., data flow).
- Lines acting as pointers represent the directionality in directed acyclic graphs, identifying which tasks and steps need to occur in which order.
As acyclic graphs, DAGs contain no cycles or loops. You cannot start at one node and follow a sequence of steps that bring you back to that same node. This is another aspect that makes them useful in data engineering to ensure a given process functions as intended. In practice, they can be used for everything from basic ETL (extract, transform, load) and data build tool (dbt) processes to advanced machine learning projects.
Helpful note: In mathematics and graphing theory, nodes can alternately be referred to as “vertices” and lines as “edges.” The terms are somewhat synonymous. But with respect to data engineering and orchestration, we prefer the use of nodes/lines due to how they relate to both workflows and ETL pipelines. 👍
Tools used to actually implement and use DAGs in data environments differ depending on the specific requirements of a given project. Variables used to determine what tool or software is required for implementation include the scale of data involved in the project, the complexity of related workflows, the programming environment, and any integration needs with other systems.
For instance, if the goal is to help automate and schedule complex data pipelines with dependencies, data teams might implement DAGs using Apache Airflow. This is because Airflow enables dynamic pipeline generation, excels at handling complex dependencies between tasks, and allows users to define workflows as DAGs using Python, a language many data professionals are familiar with.
Alternately, the same team might use Apache Spark to implement DAGs if the project instead involved building data processing pipelines for big data applications, as Spark is designed for big data processing, supports multiple languages in addition to Python, and comes with powerful libraries like Spark SQL for data querying, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for real-time data processing.
The pros and cons of DAGs in data engineering
Clearly, then, the unique experiences and skill sets of the data engineering team will determine how DAGs are implemented. However, the pros and cons inherent in directed acyclic graphs themselves will most likely determine if they are to be used in the first place.
Pros
- Clear dependency management: As discussed, DAGs produce intuitive visual representations of task dependencies. These visualizations help data engineers understand the functionality of a given workflow or pipeline as a part of the overall data lifecycle, ensuring tasks will be executed in the correct order.
- Avoidance of cycles: Acyclicity prevents loops and cycles, which eliminates potential deadlock situations that stop workflows from being completed.
- Parallel execution: Data engineers can more easily identify tasks in a workflow that can be executed in parallel. Doing so can improve workflow efficiency while reducing overall execution time.
- Reproducibility: DAGs are deterministic, which helps ensure the results of a workflow are reproducible, which is crucial for determining consistency and reliability.
- Ease of monitoring and debugging: The physics of how DAGs operate facilitates easier monitoring of workflows while improving the debugging process when issues arise.
- Scalability: DAGs are well-suited for distributed computing environments, as they allow for scaling workflows and data pipelines across multiple machines or clusters.
Cons
- DAGs come with a learning curve: Data engineers, especially beginners, should be ready to invest time in learning to use workflow management tools that are either DAG-based or that specifically offer DAG creation functionality. What’s more, those without a data engineering background may not be permitted to work with DAGs at all due to their complexity.
- Complexity in large workflows: When faced with large and complex workflows, a DAG can become difficult to effectively manage and visualize. In such situations, it can counterintuitively make understanding and modifying the workflow it represents more complicated.
- Limited flexibility for dynamic workflows: DAGs are static by nature. Therefore, more modern workflows requiring dynamic changes or conditional branching based on runtime data will be harder to map out using the process.
- Overhead in simple cases: For very simple workflows, the overhead required to set up a DAG-based system may very well be overkill. In these cases, simpler approaches might involve writing simple scripts in a language like Python, Bash, or PowerShell to automate straightforward tasks. Alternatively, ETL tools may also provide graphical interfaces that can assist in designing pipelines that can handle simple to moderately complex workflows.
- Potential for underutilization: When designed carelessly, DAGs can lead to underutilized resources, especially in use cases that require tasks running in parallel.
- Dependency complexity: Use cases that involve many interdependent tasks can be difficult to manage and optimize with DAGs. Or, at least, with the DAG itself becoming an issue.
How to determine if DAGs are right for your own data environment
There are several key questions that data engineers can ask to evaluate whether directed acyclic graphs (DAGs) would be beneficial (or potentially more trouble than they’re worth). Here's a structured way to approach this assessment:
1. Cyclic dependencies
- Do our workflows involve cyclic dependencies or iterative processes that a DAG cannot represent?
Right out of the gate, any need for cyclic dependencies will remove DAGs from the operational equation, as they cannot represent cycles.
2. Tooling and integration requirements
- What tools are we currently using, and do they support DAG-based workflows?
- Do we need integration with other systems or technologies that might influence the choice of workflow structure?
DAG incompatibilities or simple systems may also be all a team needs to know they need to pursue other solutions. Alternatively, a data environment that supports DAGs and integration with other systems may mean they’re worth any drawbacks they may carry.
3. Complexity of workflows
- Are the data workflows in our organization complex with multiple stages and dependencies?
- Do tasks need to be executed in a specific order?
If you are or will be dealing with complex, ordered workflows, DAGs may help you manage the multiple stages and dependencies they often entail.
However, they may be overly complicated for simple workflows or tasks that don’t have a strict execution order, like concurrent data transformations, or when simultaneously extracting data from multiple APIs or data warehouses where one extraction does not impact or rely on another.
4. Frequency of workflow changes
- How often do our data processing workflows change?
- Are these changes predictable or do they happen on an ad-hoc basis?
For data environments subject to frequent, unpredictable workflow changes, the rigidity of DAGs could become an issue, as they would require frequent updates. In mature industries like finance, insurance, and manufacturing, this may be a non-issue since regulations and established practices lead to stable, predictable workflows.
5. Scale of data operations
- What is the volume of data we’re dealing with?
- Do we require scalability to handle large-scale data processing?
It can be tough to justify the overhead of DAGs for smaller-scale, low-volume operations, as that overhead enables them to handle the scalability and efficiency needed to manage large and complex data flows.
6. Real-time processing needs
- Is there a need for real-time data processing with low latency?
- How critical is immediate data processing and responsiveness for our operations?
Similar to scale, the overhead that directed acyclic graphs introduce can be prohibitive for data architectures designed to handle real-time, low-latency processing. Examples of such environments include the Internet of Things (IoT), online gaming, content streaming services, and utilities.
7. Your team's expertise and resources
- Does our team have the expertise to implement and maintain DAG-based workflows?
- What resources (time, budget, personnel) are available for the implementation and ongoing management of DAGs?
One major sticking point for using directed acyclic graphs can be if a team lacks expertise and/or resources. With seasoned engineers who have enough time and budget available, the implementation and maintenance of DAGs can be well worth it. Implementation is still possible in the opposite case, but it will be challenging and may require resources that simply aren’t there.
8. Nature of data sources and end goals
- What are the sources of our data, and how do they interact with our processing system?
- What are the end goals of our data processing workflows (e.g., reporting, analytics, operational processes)?
DAGs can also be worth the worry when a data infrastructure is designed to rely upon diverse data sources in order to achieve complex end goals, like machine learning workflows and advanced data pipelines for Business Intelligence (BI). Looking at simple sources and straightforward goals? Simpler methods will probably be more efficient.
9. Need for fault tolerance and recovery
- How critical is fault tolerance in our data processing?
- Do we need efficient ways to recover from failures or rerun specific parts of our workflows?
Directed acyclic graphs are excellent in environments like banking and healthcare, where fault tolerance and efficient recovery are crucial aspects of business operations. However, when fault tolerance isn’t a priority, the pros of DAGs might not outweigh their complexity.
10. Long-term maintenance and evolution
- How do we anticipate our data environment evolving over time?
- Can a DAG-based system adapt to future needs and changes in our organization?
Finally, in static data environments that don’t expect much growth, the flexibility (along with the overhead and complexity) of DAGs may not be required at all.
What to do if DAGs aren’t for you
To crib respectfully from the late Douglas Adams, if you determine that DAGs aren’t a good fit for your data environment, DON’T PANIC. There are many other strategies available to implement organizational capabilities similar to those that DAGs provide.
Consider the following to understand how much is possible using a single, best-in-class data orchestration tool like Shipyard:
Workflow orchestration
For teams facing constraints in using traditional DAG-based tools, cloud-based platforms like Shipyard provide comparable workflow orchestration capabilities—making it easy to manage and automate data workflows.
Flexibility and customization
Orchestration tools like Shipyard also allow for a high degree of customization and flexibility in workflow design. This can be of particular use for teams needing specific solutions that align with their unique data environment constraints.
Integration capabilities
Shipyard also supports integration with various data tools and systems. This can be beneficial for teams needing to connect different parts of their data infrastructure seamlessly.
Ease of use and rapid deployment
Many teams need quick and efficient workflow management solutions that don’t come with a steep learning curve or are overly complex to use. Tools like Shipyard are designed from the ground up to be user-friendly in order to enable rapid deployment of data workflows.
Scalability
As a cloud-based solution, Shipyard offers scalability, which is crucial for data engineering teams dealing with varying volumes of data and processing needs.
To DAG or not to DAG (that is the question)
In summary, knowing whether you should DAG or not DAG is similar to knowing if you’re in need of a quarter-inch drill bit or a quarter-inch hole. There are many ways to get the functionality required to build workflows and data pipelines that are flexible, customizable, and can scale on a dime. But not all of those ways can be explored for free.
Fortunately, Shipyard can be. As you’ll see, our data operations platform has data orchestration capabilities like Airflow, but 100x faster and just as powerful. Move your data at superhuman speeds with the power of your code and low code combined.
Sign up today to transform and refine your datasets into workflows in 10 minutes or less, reducing downtime and speeding up business processes—all without needing a credit card.