Data pipelines work a lot like pipelines in the real world, transporting raw material—your data—from its origin to a centralized location where it can be used. They automate the process of combining and moving data so engineers can spend their time on more mission-critical tasks.
Moving data from one place to another sounds simple, but it often isn’t. Generally, the data has to be combined and transformed on its journey to ensure it can be received at its destination. And the entire process needs to be monitorable, so you can detect when issues arise.
Building a data pipeline that works well—and that will continue working well—requires thinking about architecture.
What is data pipeline architecture?
Data pipeline architecture describes how a pipeline is designed. Planning your pipeline architecture involves determining all data sources, desired destinations, and any tools that will be used along the way. You’ll also need to figure out what types of operations you’ll need to perform on the data, such as joins and transformations, so you can plan and structure the code you’ll use to make those transformations happen.
Important considerations for building a data pipeline
Data pipeline architecture should be considered carefully. While there are many ways to move data from a source to a destination, not every method of doing this will meet the requirements of your use case.
Every company is different, so it’s important to think through what your requirements are for the pipeline, and what they are likely to be in the future. Designing a pipeline that’s going to help your DataOps team by taking work off of their plates requires forethought—a poorly-designed pipeline can cause more problems than it solves.
There are typically four major areas of need to consider when architecting a data pipeline:
- Speed: How quickly does data need to move through the pipeline? If you’re sending data to an analytics database for report generation, for example, your speed requirements may be minimal. On the other hand, if your pipeline is feeding a real-time predictive machine-learning engine in your product, building a low-latency pipeline is crucial.
- Throughput: How much data does the pipeline need to move, and how often? Tools and code that can easily handle 10,000 rows a day may not be able to handle 10 million, so the amount of data you need to move will inform your pipeline architecture choices. (It helps to be forward-thinking and build a pipeline that can handle more throughput than your current use case requires. The amount of data that companies process rarely drops.)
- Reliability: How much time do you want to spend on the pipeline after it’s built? A data pipeline that’s frequently breaking, throwing errors, or delivering incomplete or corrupted data can be a tremendous time-sink for your engineers. Your pipeline architecture should include clear methods for monitoring, logging, and validating data before it’s delivered. This way, if (or when) things do go wrong, it’s easy to find out what happened and where along the pipeline it took place.
- Adaptability: How easy would it be to swap in a different tool or add a new data source? Business needs and tools change all the time. Your data pipeline architecture should be planned with this in mind. Building a more adaptable pipeline that can handle a wide variety of input types will make it easier to keep the pipeline running when, for example, the tools generating some of your marketing data (such as your CRM tool) change.
When preparing to design and build a pipeline, you’ll also want to consider where it will be deployed, as this may influence some of your design and tooling decisions. For example, will it be deployed to a cloud service such as Amazon’s AWS, or will it run on your own on-premises machines?
How to design and build a data pipeline
While every business has different needs, here’s a simple approach to designing and building a data pipeline. Note that when planning it makes sense to think through the pipeline in the linear order we’ve outlined below. However, when you’re actually building the pipeline, a linear workflow like this is not required, and multiple steps can be built simultaneously.
Step 1. Assess your data sources
What data sources, data sets, and tools will you be pulling data from? Compile a complete list of all sources and the specific data you’re interested in pulling from each. This might involve pulling raw data from databases and other data storage solutions, streaming data sources, and directly pulling data from the tools and services your team uses to work. It may involve working with both structured and unstructured data.
In this step, it’s also important to think about data structure and types of data. It may be helpful to create a data dictionary defining the specific information you’ll be pulling from each of your data sets and how it is formatted. Having this on hand will make building your transformation engine easier.
Step 2. Create your ingestion engine
Once you know what data you need and where you’re pulling it from, it’s time to design the engine that pulls the data into the pipeline. Again, the specifics may vary—some data may be available via app APIs, while you may need to write scripts to export data from other sources.
At this stage, you’ll also need to settle on whether you want your pipeline to deliver streaming data—ingesting data in real time and delivering it as quickly as possible—or whether batch processing is better. This will depend on a variety of factors including your throughput requirements, the resources required to ingest data from your various data sources, and the plans for the data once it reaches its destination.
For example, if your pipeline is centralizing data in a data warehouse for analytics, batch processing is probably sufficient and will allow you to minimize the impact of your pipeline on other systems. But if your pipeline is pulling customer behavioral data to feed a real-time recommendation engine in your product, ingesting the data as soon as it hits your data source to then stream it right to the destination is likely the best course of action.
Step 3. Define and build your data transformations
Once your data is in the pipeline, you’ll almost certainly have to do some amount of data integration and data processing to transform it. What is specifically required varies depending on the data you’re working with, but in general this process will likely involve several aspects:
- Joining data from multiple sources into a single table (for example, joining customer data from different marketing tools to create a single table with a row for each customer)
- Transforming data to make data types and formats conform to the requirements of your final destination (for example, you may have a database that’s storing dates in DATETIME format, while the destination uses TIMESTAMP format)
- Aggregating data (for example, counting customer interactions from the data collected by various tools to aggregate a “total customer touches” datapoint for each customer)
- Filtering data (for example, removing incomplete, bad, or irrelevant rows or entries)
Needless to say, this step requires careful consideration of the data schema your final destination requires, particularly if it's an SQL database where that schema is rigidly enforced.
It’s worth noting that this step occurs before the data is delivered in an ETL pipeline, but transformation can also occur after the data is delivered, which often happens in ELT data pipelines.
Step 4. Deliver the data
Once the data is transformed, your pipeline must deliver it to its final destination. This is often a data warehouse, data lake, or another centralized big data store. However, a business might also use their data pipeline to feed data directly into an analytics tool or an application service. Where you’re delivering your data will depend on your use case, and how it’s delivered will depend on the particular destination or destinations you’ve chosen.
Step 5. Set up monitoring
The final, critical step is to set up monitoring and ensure transparency for every stage and every element of the pipeline. This includes monitoring the software itself as well as any machines it runs on. If and when something breaks, you want to have a solid monitoring system that can both inform you that something has gone wrong and provide logs and error messages that point directly to the problem. Good monitoring will also allow you to optimize the pipeline, highlighting inefficiencies that your data engineering team can correct.
Common architecture patterns for data pipelining
Data pipeline architectures can differ depending on the use case and the data and tools being used. However, here are a few typical architectures for common use cases:
Business intelligence data pipeline architecture
This kind of batch processing architecture is ideal for analytics use cases, where data is delivered to a data warehouse for later analysis by data scientists, for report and visualization generation, and more.
The diagram below depicts a classic ETL pipeline, where data is ingested from a variety of sources (including SaaS tools, relational databases, and NoSQL databases), transformed, and then delivered to a data warehouse.
Streaming data pipeline architecture
This diagram depicts an ETL pipeline that’s designed to deliver data to its destination in real-time. This type of approach might be used to feed a recommendation engine, real-time analytics platform, or any other service that requires access to real-time data.
Note that here, we have two APIs that deliver streaming data directly to the ingestion engine, as well as a database and an application service that feed data into Apache Kafka, which then feeds them to the ingestion engine. (We’ll touch on Kafka later in this article.)
ELT pipeline architecture
If your destination of choice makes doing data transformation after delivery a viable option, an ELT pipeline architecture like this could work:
Data pipeline tools
While it’s certainly possible to build an entire pipeline from scratch, doing so can be time-intensive. Further, without consistent updates, the end result may be less adaptable than a pipeline built using third-party tools. Let’s take a quick look at some of the types of tools that are available to help build a data pipeline without overwhelming your DataOps team.
- Open source frameworks: Frameworks like Python’s Pandas make working with data easier in your programming language of choice. Open-source tools like Apache Airflow are also useful for facilitating scheduling and monitoring the operations of your data pipeline.
- Event streaming tools: Tools like Apache Kafka simplify the process of extracting and delivering streaming data. For example, let's say you’ve got application services that are generating event data and you want to pass that data into a pipeline. Kafka can pull and transmit that data, providing a message stream that is reliable and proven at scale. Not everyone needs this kind of power, but if your pipeline requires moving huge volumes of data quickly and reliably, using a tool like Kafka will be easier than trying to build your own service to pull and stream your event data.
- Data orchestration tools: Tools like Shipyard provide plug-and-play solutions that make it easy to connect commonly-used services and move data between them. These low-code connections cover many of the most common data sources and destinations, and can be integrated with your own custom code for any sources that aren’t supported. This kind of tool can minimize the time required to develop a pipeline as well as the effort required for maintenance.
Build your own pipelines the easy way – give Shipyard a try today!