Building an ETL Data Pipeline Using Azure Data Factory
ETL pipelines play an expansive role in modern business and IT environments. Traditionally associated with data warehousing, these versatile parts of the DevOps toolkit are used to enable everything from business intelligence (BI) and master data management (MDM) to advanced analytics, big data, and machine learning initiatives.
And, as one of the most popular tools for ETL pipeline creation, Microsoft’s Azure Data Factory (ADF) plays a substantial role in getting all this work done. But what makes ADF so popular with the ETL-adjacent crowd? How exactly does an ETL pipeline function? What are folks who aren’t using Azure Data Factory doing to meet their own data pipeline needs?
All excellent questions.
Let’s start at the top:
What is Azure Data Factory?
Microsoft's Azure Data Factory (ADF) is a cloud-based data integration service that enables users to create, schedule, and manage data-driven workflows. These workflows, colloquially called pipelines, are designed to move and transform data between supported data stores (e.g., data lakes and data warehouses).
While primarily used as an ETL (extract, transform, load) and ELT (extract, load, transform) tool, ADF can also integrate with other compute services in the Microsoft Azure ecosystem for data transformation tasks, like Azure SQL Database, Azure Portal, etc. This versatility makes it a formidable solution for data movement, transformation, and orchestration.
Anatomy of an Azure ETL data pipeline
Compared to ELT, more engineers are experienced in extract, transfer, load implementations in today's marketplace, as it is the more mature data transformation method. Therefore, examining a hypothetical Azure ETL pipeline is a helpful window into some of the specificities of Microsoft's data integration service.
As noted, an Azure ETL pipeline functions within the greater Microsoft Azure ecosystem to source data, transform it in a desired format (or structure), and then load it into a target data store. When executed through ADF, here is how these three steps of the pipeline would typically shake out:
Data extraction:
- The ETL process begins as ADF retrieves data from sources like Azure SQL Database, Azure Cosmos DB, or, perhaps, a file system like Azure Blob Storage.
- While Azure Data Factory does present some limitations to users being part of Microsoft's Azure ecosystem, ADF allows for extraction from external platforms and services.
- The extraction process in ADF often requires users to create linked services to connect to targeted data sources and datasets.
Data transformation:
- ADF will then transform extracted data into suitable formats for reporting or analysis.
- As with any ETL pipeline process, ADF transformation typically includes common functions like data cleaning and filtering, aggregation from various sources, and restructuring. However, some use cases require Azure's data transformation process to employ additional functionality, such as converting data types or applying business logic and null values.
- Data transformation will occur using data flows or via integrations with other Azure services like Azure HDInsight for Hive, MapReduce jobs, or Azure Databricks for Spark-based transformations.
Data load:
- ADF will then load transformed data into its user's targeted data store. Within the Microsoft Azure ecosystem, this could be Azure Data Lake Storage, Azure SQL Database, or Azure Synapse Analytics.
- Azure Data Factory will also typically use copy activities to orchestrate this final phase of the ETL pipeline process.
Azure ETL pipeline distinctions to keep in mind
While the process of ETL orchestration using Azure Data Factory is very similar to that of other tools, ADF features several differences and differentiators that users should take into account:
Cloud-native integration: By default, ADF is deeply integrated with the other services that comprise Microsoft's Azure ecosystem. This integration allows ADF to integrate seamlessly with other data engineering tools, like Azure SQL Data Warehouse and Azure Blob Storage. However, this can be a significant disadvantage for organizations that have yet to invest in the overall Azure ecosystem.
Management and monitoring: Integration using Azure Management and Governance, coupled with Azure Monitor, provides ADF users with robust management and monitoring capabilities.
Continuous integration and delivery (CI/CD): ADF's integration with Azure DevOps enables CI/CD automated deployment and testing of the user's ETL pipelines.
Integration runtimes: Azure Data Factory offers users different types of integration runtimes. These include Azure, Azure-SSIS, and self-hosted, enabling hybrid ETL scenarios where data can be moved across Azure, multi-cloud, and on-premises environments.
Serverless compute: Teams that utilize Azure Data Factory benefit from cost savings and reduced management overhead since no infrastructure management is required.
Extensibility: Teams can extend ADF by using Azure Functions, which allows them to incorporate custom code and logic into their ETL processes.
Integration with Power BI: ADF also integrates seamlessly with Power BI, another Microsoft product, which affords teams access to easy reporting and analytics on their processed data.
Code-free ETL: Azure Data Factory emphasized a low-code or code-free approach to ETL, as opposed to other tools that can require significant coding knowledge.
Visual data flow: ADF provides further usability through its visual interface for designing ETL processes, a major advantage for engineers who prefer a graphical user interface (GUI) over coding.
Security: ADF also provides enterprise-grade security, which includes features like virtual network service endpoints, firewall rules, and managed private endpoints.
Performance and scalability: Azure Data Factory's cloud-native integration allows it to scale as needed to handle vast amounts of data. Alternately, ADF can scale up and down based on workload, automatically ensuring optimal performance at all times.
Pay-as-you-go pricing: Finally, as part of its performance scalability, ADF's pricing model adjusts accordingly, meaning users only pay for the resources they consume. This fact alone makes it a cost-friendly option for teams that deal with sporadic ETL jobs.
Building ETL pipelines when ADF isn't an option
Despite all of its advantages, there are plenty of teams without access to Azure subscriptions, or who work in IT environments that just aren’t a fit for Azure Data Factory. That's why it's helpful to understand how to build and maintain ADF-comparable ETL pipelines with other tools (especially those untethered to any particular ecosystem).
As an example, here's how to do exactly that with Shipyard, our own data operations product:
Rapid launching: By using a data orchestration platform like Shipyard, automated workflows are quickly built on a fully hosted cloud infrastructure. And teams can use a mix of their own custom code along with the open source low-code blueprints we provide. This is all analogous to defining datasets and linked services when using Azure Data Factory.
Always-on monitoring: To get the same level of pipeline management and monitoring that ADF provides, users can set up alerts in Shipyard that notify them of any issues in the pipeline. In addition to automatic retries, this provides critical visibility into the workflow status.
Effortless scaling: Shipyard dynamically scales to meet workload variability (whether you're running ten jobs or a thousand). Like ADF, this ensures infrastructure limitations never bottleneck data operations.
Blueprint selection: Before building an ETL pipeline, users can save time by choosing a blueprint from our library or opt to start from scratch, using languages like SQL, Python, and Bash. Shipyard also enables the use of custom blueprints, which allows teams to reuse existing solutions quickly.
Building a workflow: On the Shipyard platform, units of work are referred to as vessels. Shipyard users can walk through a streamlined process to provide your code, requirements, and all inputs necessary for running your solutions. And, for complex tasks with multiple steps, you can build two or more vessels and combine them into a fleet. This functionality is similar to creating multiple activities in ADF and chaining them together (albeit without any distinctive, on-brand nautical nomenclature).
Pipeline execution: Once users set up a vessel on the platform, they can schedule it to run at specific intervals or on demand. Shipyard also ensures that every run is retried automatically in case of failures and that multiple vessels can run concurrently without issue.
Log checking: When using Azure Data Factory, users can monitor activity runs. With Shipyard, users track every vessel run in logs, making it easier to diagnose and fix issues without having to dive deep into each.
Ample integrations: One of ADF's Achilles’ heels is the challenges faced when users must integrate with tools and services that aren't part of Microsoft Azure's ecosystem. Ample integrations are, by contrast, one of Shipyard's strengths.
To date, the platform offers 20+ integrations and over 150 low-code templates, allowing users to connect their entire data stack in minutes. Supported integrations include Amazon Redshift, Amazon S3, Azure Blob Storage, Google BigQuery, Slack, Snowflake, and many more.
Collaboration: Finally, high-performing teams building and managing ETL pipelines require strong collaboration. While possible for ADF users through Azure DevOps, we engineered collaboration into Shipyard's DNA. As a result, entire data teams can create and reuse workflows together, ensuring DataOps as a whole is set for smooth sailing.
In search of a better tool to help streamline all your IT needs?
As great as it is, Azure isn’t for everyone. So, If you're looking to build an ELT pipeline and are familiar with the benefits of Azure Data Factory, Shipyard offers a comparable and robust platform. Its emphasis on rapid launching, monitoring, and a plethora of integrations makes it a worthy contender for use in any IT environment.
Shipyard is a data orchestration platform designed to help data practitioners quickly launch, monitor, and share highly resilient data workflows. Our goal in creating Shipyard was to provide rapid launching, always-on monitoring, effortless scaling, built-in security, and a veritable boatload of integrations.
We also built Shipyard’s data automation tools and integrations to work with your existing data stack or modernize your legacy systems.
If you want to see for yourself, sign up to demo the Shipyard app with our free Developer plan—no credit card required.