Big Data Pipelines: What They Are and How to Design One
Even if you’ve never seen Stephen Spielberg’s original Jaws, chances are good you recognize one of AFI’s 100 greatest movie quotes of all time, “...you’re gonna need a bigger boat.”
In the scene the quote comes from, Martin Brody, a cynical, know-it-all chief of police has just gotten a first (much too close) look at the movie’s titular monster. And only now does Brody appreciate the size of the problem he and his tiny crew are up against. Tragically, it’s too late to do much of anything about it.
Now, fortunately, the only sharks lurking down in the oceans of big data are Hive-compatible SQL engines. For now. But cost and resource limitations are real issues for DevOps teams, specifically roles in data science and data analytics. So, sometimes underinvestment in data transformation pipelines isn’t apparent until a big issue comes out of nowhere, biting data engineers and data scientists in the butt. Just like Jaws, these potentially costly issues are much easier to prevent than they are to handle in the moment.
For this reason, it’s mission-critical for DevOps teams to understand how big data pipelines differ from regular pipelines, so they never end up realizing they “need a bigger boat."
Understanding data pipelines by how they source data
It’s true, Jaws is, technically, a shark. But side-by-side, the differences between Jaws and the average selachimorpha are easy to see.
Gauging the difference between data pipelines and big data pipelines is a bit less obvious, however. It’s hard to directly compare the invisible formulations of instructions and equations that make up algorithms.
But, just like in biology, we can learn a lot about a creature by focusing on how it eats. So, let’s start by breaking down the underlying architecture of traditional data pipelines. Doing so makes it easier to appreciate those built specifically to devour big data for DevOps teams.
Basic pipeline architecture commonly breaks down into two methods of data sourcing: batch processing and streaming.
Batch processing works like the opportunistic feeders that swim around in our oceans, as data pipelines bite off one batch of source data at a time. As data ingestion takes place, it’s fully transformed into a dataset and delivered to specific data store providers (e.g., a data lake or data warehouse) before another batch is taken in.
Conversely, streaming pipelines function more like aquatic filter feeders. As opposed to transforming data batch-by-batch, transformation and delivery take place continuously, as new data flows into one end of the pipeline in real-time, then out the other.
These two types of data sourcing and transformation have pros and cons. Batch processing is actually okay at handling large amounts of data. But said data needs to be known and finite. Batch processing pipelines are also complicated to manage, and expensive to build.
Quite the opposite, stream processing results in lower latency, meaning the streaming data it processes is available “in the moment,” which is perfect for time-sensitive applications. But, as the demands and applications for real-time data grow (e.g., using machine learning for demand forecasting), stream processing can be easily overloaded, making it difficult to implement at scale.
How a hybrid approach helps big data pipelines source at scale
In the oceans, apex predators stay at the top of the food chain because they’re eating machines, and not limited to one specific source of feeding. And big data pipelines are similar, in that they aren’t limited to batch or stream processing. Thanks to Lambda architecture, they can do both.
As sourced data hits the big data pipeline, it's first ingested into the batch layer. Like the stomach of an oversized, ocean-stalking apex predator, this is where the data rests, waiting to be used by other parts of the pipeline.
The fact that the master dataset resides in the batch layer adds to the resiliency of big data pipelines (resiliency being very important). Because if (when) anything goes wrong in the other two layers, the dataset can be recovered or reconstructed as needed.
As data from various sources come to rest in the batch layer, the serving layer provides a real-time view of the growing dataset. And it’s this continuous batch-layer overwatch that enables interested parties (e.g., data engineers, data scientists, data analysts, etc.) to access near-real-time information about the dataset as needed.
Remember, though, any master data set on a big data diet is going to get really big, really fast. And it's for this reason that the speed layer separates the pipeline monsters from the guppies.
The speed layer executes incremental computations on all the additional datasets created by the batch and serving layers. This way, using specific streaming frameworks, the speed layer can provide and maintain low latency access to datasets of enormous size.
How do you build a big data pipeline?
So, if you’ve decided you need your own big data Jaws, the next logical question is how to go about building one. The basic approach to building your own big data pipeline mirrors how the data itself flows through these pipelines.
1. Batch layer
First you must set up your batch layer. Again, this layer will perform the initial processing of your source data and will save it all in the belly of your beast. This layer should use an Extract, Transform, and Load (ETL) process to convert raw data into structured data as it’s ingested so that it can be easily queried. To do this, adding an indexing system will help ensure your queries run faster.
2. Speed layer
The speed layer comes next, which will handle the real (or near-real) time processing of all the big data your pipeline will be consuming. Apache Storm or Apache Spark Streaming are two examples of tools that you can use for this step. Ideally, use tools that provide high throughput processing capabilities. In doing so, your pipeline will process data faster than it would if relying on more traditional databases, like MySQL or Postgresql.
3. Serving layer
With your speed layer set up, it’s time to add your serving layer, which will provide the interface you’ll need to query data from the batch and speed layers (and produce aggregated results in response to your queries). Look to use tools here that allow for fast retrieval of big datasets stored in their respective databases, like Cassandra and HBase.
4. Optimize performance
Finally, make sure you optimize your pipeline’s performance. Because the biggest of the big data pipelines in the world isn’t much good if it can’t quickly and efficiently process the data it consumes. The optimization process can involve setting up caching servings in addition to compressing any intermediate results the pipeline produces before they’re stored in the batch or speed layers.
Also, consider using distributed file systems like HDFS whenever possible. Compared to local file systems like NTFS or ext4, distributed file systems will provide higher throughput.
Taken together, these are the basics of building a pipeline that can handle any amount of data you might throw at it.
Continuing the hunt for your own big data pipeline
Spoiler alert: Jaws doesn’t end well. Especially for Jaws. (And one of Brody’s crew.) Despite needing a bigger boat, Brody survives to shark another day.
DevOps teams, however, need never face such a monstrous situation when taming a best consisting of vast volumes of data, multiple data types, evolving forms of data processing, and all from data sources both off and on-premises. By investing in the right pipeline technology, they can chart their organization well away from danger, while increasing reporting and analysis accuracy at the same time.
That’s why an increasing number of teams are trusting Shipyard to help in their data navigation. We build data pipelines and other data transformation tools with scalability in mind, that can rise to meet challenges and use cases that are large, not so large, and everything in between.
So, schedule a demo of the Shipyard app or sign up for our Developer plan that’s free forever—no credit card required. Start to build data workflows in 10 minutes or less, automate them, and see how Shipyard fits your business.