ETL vs. ELT?

In our social media and marketing-driven era, it's quite hard to get things right. For me there is one common misconception brought by the Modern Data Stack idea that everything should be now ELT. In fact no, it shouldn't but only can.

As a curious data engineer, I have worked with both approaches but turns out, I've never asked myself this fundamental question, what does it involve having ETL-only or ELT-only pipelines. Since it sounds like a lightweight topic and I'm writing this a Sunday afternoon, it won't be another moment to approach it!

Definitions and introduction

If you are in data engineering you certainly know these terms but just for the heck of it, let me introduce them before talking about any differences. The ETL stands for Extract Transform Load while ELT for Extract Load Transform. They are both a common approach for processing data. The difference between is firstly technological. While ETL favors a pure data processing layer, such as an Apache Spark cluster, the other prefers a kind-of data warehouse-first citizenship where the data directly lands in the final data store and gets transformed there.

Despite this technological difference, both approaches are somehow in the continuous fight, especially in our social media era. "The end of ETL", "Long live to ELT", ... you've certainly seen some of the headers like these. And they were there even before this social media eta, but at that time, the fight was mostly driven by technology. Either the storage was too expensive and people preferred ELT, or the data was so huge that scaling a data warehouse to handle it was impossible, so everyone implemented ETL. I'm oversimplifying this a bit but found the decisions in that period less influenced by the marketing and superficial opinions I'm challenging in the last section.

Nowadays, ELT seems to have regained popularity but IMO, it's only partially related to the pure technical reasons. Modern Data Stack is an interesting technical concept but it also has a strong marketing support that hides the most important point to keep in mind while designing data platforms. The context. If the transformations you're performing are difficult to express in SQL, even with the support of UDFs or scalable data warehouses, you'll certainly go for ETL pipelines or otherwise, have to write a monster that nobody except you will be able to maintain (can you imagine maintaining a query long for multiple A4 pages?). If on the other hand, your data logic is rather simple and can be fully supported with SQL, you may go with ELT. I put "may" because ELT implies other things that could make you stay with ETL, and vice versa!

So if you have expected to find a firm sentence like "ETL is dead" or "ELT is dead" somewhere in this blog post, sorry for disappointing you. Instead, I'll give you some technical reasons for preferring one or another approach. Or why not both, because they're not exclusive in a data platform!

Pros

You can already see, there is no silver bullet. Both patterns have their strong and weak points and on purpose, I listed 3 pros for each of them. I won't focus on cons because you can conclude them from the list above. Instead, I'll directly try to bust some myths about ELT and ETL.

Myths busted

There is probably one pattern that will fit better into your context, the pattern those drawbacks you'll accept more. Because yes, none of them is perfect. Both have some downsides, both can be badly implemented, and so despite all the - often purely marketing - content that you can find in favor of one or another. And before I let you go, let me embed a LinkedIn post that I saw while writing the "Myths busted" part. It really shows where your focus should go. Thank you, Daniel for this one!


If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!