ETL vs. ELT: What’s the Difference, and Which is Best for You?
ETL (extract, transform, load) and ELT (extract, load, transform) are two common methods of moving data from one place to another in a data pipeline. But how exactly do they work, what's the difference between them, and what are the benefits and drawbacks of each approach to moving your data?
What is ETL?
ETL stands for extract, transform, load. These three words describe what is happening to the data, and the sequence in which it happens, as part of a data pipeline. In the case of an ETL pipeline, the data is extracted from its source or sources, transformed, and then delivered to its final destination.
For example, an ETL pipeline might be used to pull data from various sources, unify it, and insert it into a data warehouse for later analysis. As part of this process, an ETL pipeline does the following:
- It extracts the data from its source or sources. Common data sources for ETL pipelines include SQL or NoSQL databases, SaaS tools, and streaming data APIs.
- It transforms the data. This may involve joining data sets from different sources (for example, joining customer data from several different SaaS product and marketing tools), cleaning the data, and transforming the data (for example, converting the data to a different type) to ensure it conforms to the standards required by the destination.
- It loads the data, delivering it to the destination (often a data warehouse). Because the data has already been transformed to meet the ingestion requirements of the destination, as soon as the data has been delivered it is ready for use without the need for further processing.
What is ELT?
ELT (extract, load, transform) describes the sequence of actions performed on data that is part of an ELT pipeline. What sets ELT pipelines apart from ETL pipelines is that in the case of ELT pipelines, the raw data is delivered as-is to the destination, where it is subsequently transformed.
Here’s how the sequence works in an ELT pipeline:
- The data is extracted from its source or sources (via automated export, APIs, or a variety of other methods).
- The data is loaded directly into the destination (often a data lake), without any intermediary processing.
- The data is transformed. Since raw data has been loaded onto the destination, it typically will require processing before it can be effectively used. This data transformation might occur via automated processes executed by the destination server, or it can be done on an ad hoc, on-demand basis.
ETL vs ELT: What's the difference?
The difference between these two types of data pipelines is how the transformation happens—and thus, what is being delivered to the destination server.
In an ETL pipeline, the transformation happens en route, and what’s delivered to the destination is often clean, structured data that is easy to query and work with immediately.
In an ELT pipeline, the transformation happens after delivery, so what’s delivered to the destination server is often raw data.
Which is better, ETL or ELT?
Both ETL and ELT are valid ways of configuring a data pipeline, and one is not inherently better than the other. However, their key differences lend themselves to different use cases. Each also requires a different workflow and varying skill sets.
What are the advantages and disadvantages of ETL?
The main advantage of an ETL process is that by the time the data gets to the destination, it has already been joined, cleaned, and transformed so it's ready for whatever comes next.
For example, if your primary use case is analytics, your data should be ready to work with the moment it hits the data warehouse. Analysts shouldn’t have to spend time joining and cleaning data—they need to simply grab the necessary information (often via a simple SELECT query in the case of a relational database) and jump right into the work of analysis.
Because ETL pipelines have been common for quite a while now, another advantage is ease of use. A variety of user-friendly ETL tools such as Stitch, Fivetran, and Shipyard exist to facilitate their creation and maintenance.
The primary disadvantage of ETL pipelines is that because the transformation happens en route, there is typically a meaningful delay between extraction of the data and delivery of the final data sets to their ultimate destination.
The transform stage of an ETL pipeline also tends to be complex, as it has to take data from a variety of sources, apply various transformations to it, and make it conform to the requirements of the destination. Destinations such as relational data warehouses may have rigid schemas. This complexity can make ETL pipelines challenging to build, and they also require monitoring and occasional maintenance. The addition of a new data source, for example, will probably necessitate making changes to the pipeline.
Finally, because ETL pipelines require an intermediary staging area where data is transformed before being passed to the target system, they can be less efficient.
Compute resources, for example, are required to process all the data that passes through an ETL process, even though some of that data may not get used after it hits your data warehousing solution. Depending on your use cases and the data you’re extracting, it’s possible you could end up paying to process a lot of data that you never look at again because it has no real business intelligence value.
When should you use an ETL pipeline?
ETL pipelines are best suited for use cases where very low latency is not required for the end use case.
For example, the loading of data from a variety of data sources into a data warehouse for regular analysis is a common use case for ETL pipelines, because most data analysis doesn’t require real-time data. If your analysts are preparing a weekly or monthly report, they’re likely pulling data that fits between predetermined start and end points—not up-to-the-minute data logged just seconds before they ran their query.
ETL pipelines are also well-suited for use cases where ease of use is important. Though the pipelines themselves can be complex to build without the right data orchestration tools, once they are built, they make working with the delivered data fairly easy. This is because the data has already been cleaned and transformed to the specifications of the destination data warehouse.
ETL pipelines shine where sensitive data is involved or where there are rules about what data can be stored in the cloud versus on-premises, etc. This is because they allow you to create strict schemas that define precisely what data is stored where and in what format.
Data that you don’t want or can’t store can be eliminated prior to delivery via the transformation process. GDPR and HIPAA, for example, can require that certain data be kept in specific locations. If you’re using a cloud data warehouse, an ELT pipeline could violate these rules, as it would transmit raw user data to a warehouse that might not be in the right location. With an ETL pipeline, skilled data engineers or ETL tools use the transformation process to identify and remove the target data, ensuring that it doesn’t get beyond the source system.
What are the advantages and disadvantages of ELT?
The primary advantage of an ELT process is speed of delivery. Because the data doesn’t have to be transformed en route, it can be extracted from the sources and loaded into the desired destination in near real time.
Depending on the use case, ELT pipelines are more resource-efficient, because they don’t require transforming all of the data prior to delivery.
In an analytics use case, for example, an ETL pipeline would transform all the data it extracts, even if that data is never ultimately used by analysts. In contrast, an ELT pipeline doesn’t transform any data before it reaches the destination. With an on-demand transformation setup, only the data your analysts actually query is processed. This can save money and make the transformation faster.
The main downside of ELT pipelines is that because they deliver raw data to the destination, they require more expertise to work with after the data has been delivered. You’ll need either a team that is comfortable working with unstructured/raw data—to make the required transformations themselves—or engineers who can build sophisticated systems to automate the transformations you’ll need after data is loaded.
When should you use an ELT pipeline?
ELT pipelines are ideal for use cases that require speed. For example, if your product features a recommendation engine driven by machine learning (ML), an ELT pipeline delivers customer data to power those recommendations in near real time—which could be the difference between making or missing a conversion.
ELT pipelines are also better if you want to remain flexible. With ETL pipelines, performing data integration and cleansing prior to delivery means that unnecessary data points may be dropped, rows may be removed from tables if they have incomplete data, or data types may be converted in ways where some data is lost (as in converting DECIMAL or FLOAT data to INT in SQL). ELT pipelines make it easy to quickly move all your data into a data lake, and then apply whatever transformations might be required after the fact, without losing any of that original raw data.
ELT pipelines are often the best choice for very big data applications. If you want to move large amounts of data to your target database, data warehouse, or data lake for storage, transforming all that data en route would require a lot of processing power. Depending on what you plan to do with the extracted data after the load process is completed, that expense might not be necessary. An ELT process that allows you to pay to transform only the data you need may be the best choice.
Ultimately, both ETL and ELT pipelines have their uses, and either one or both might be right for you, depending on your use case.
Both can be complex to build and set up, but data orchestration tools like Shipyard can make the building and maintenance process a whole lot easier—sign up today to try our free forever plan.