Transformation: Next-Level SQL is part 3 in the Summer of SQL Series with Joe Hellerstein, professor at UC Berkeley and Co-Founder at Trifacta. For part 1, read Summer of SQL: Why It’s Back and part 2: SQL Pipelines and ELT.
What’s different about SQL for ELT… i.e., SQL for data transformation?
When we think about SQL, traditionally, we assume we have a nice clean database, and we’re asking basic questions for business purposes. You know, things like “How many of the cell phones that we manufactured last year did we sell to people under 30 in Nevada?” People talk about SQL as a “query language,” which is what the acronym stands for—the Structured Query Language.
But when we use SQL for Transformation—the “T” in ELT—the focus changes. In this case, we’re taking many messy and disparate tables and manipulating them into a more usable or common form. To take our example from before, we may be extracting and loading sales data from 17 electronics chains that sold the phones, and our job in SQL is to write transformation queries that integrate that data. For example, we might have to convert each source table to extract common notions like “what products qualify as a cell phone?”, “what constitutes a completed sale?”, “how do I find the home state of a customer?” “how do I predict age information for a customer if it’s missing?” Etc. etc.
Of course, SQL is the same language whether you use it for transformation or querying. But the parts of the language that get emphasized are different. Bottom line: SQL for ELT is often more complex than SQL in analytics settings.
Talk a bit more about that: what kinds of SQL complexity do you see in ELT?
There are many different kinds, which is itself part of the complexity. Remember that a table of data that’s not yet transformed could be pretty raw in several ways. Let me mention a couple of the “biggies.”
“Regexes”
One issue I see in ELT settings is what people call “text blobs”: columns that store text strings initially intended for human consumption. Typical examples might include product descriptions, text from a customer-support chatbox, or transcribed doctor notes from medical charts. Often you want to pull little patterned snippets out of that text—say the occurrence of names for symptoms or medicines. SQL does allow for that: for example, most SQL engines support regular expressions for parsing text. But “regexes” are notoriously tricky, and they’re just as tricky in SQL as they are in any other language.
Traditional SQL queries over clean data warehouses don’t use regexes nearly as much as ELT use cases. And sometimes it goes next-level. When regexes aren’t sufficient, many SQL systems will also let you call out to external services or code for Natural Language Processing via AI. That raises operational issues: Who deploys and maintains those services or code? How reliable and explainable is that AI? How do downstream uses of the data know that they’re looking at AI “guesses” rather than recorded data? Etc.
Data Integration
Another issue in ELT is data integration: the need to write SQL to take data from multiple sources and make it consistent. Take the case of sales records from multiple retailers. At the most basic level, they may have different text strings for the same product, like “iPhone 12 mini” vs. “Apple iPhone 12 mini 5G” and so on. They may represent notions like a “sale” differently at a more conceptual level: one vendor may record “total sale,” including local sales tax, where another one does not. And at another level, the table structures from the different sources can be different: one might have a separate row per item sold, and another might have a row per checkout basket with the specific items nested inside.
You can’t solve these problems until you know you have them! So each of these inconsistencies first needs to be discovered in the data via SQL queries. Then each case needs to be transformed using SQL into some standard output table formats. That data integration pipeline has to be built out so that traditional analytics queries on the totality of the data make sense.
This leads to a process-oriented difference in ELT, namely the need to build reliable data pipelines in SQL. In ELT, new data is extracted and loaded regularly. The transformation code in the pipeline needs to run correctly with new data loads, while also testing for and flagging any deviation from the previous conventions—all that in SQL. And then, of course, the SQL code has to be maintained as the input data evolves and changes. This differs from common assumptions about analytics, where people often use SQL for one-shot queries on static data.
That does sound more intense than traditional SQL uses. How do tools make this easier?
Even if you’re a serious data transformation person and well-versed in SQL, you can benefit enormously from automation assistance in doing your work. There are three buckets of approaches that people have come up with over the years to help.
- Language improvements. “Macro”-languages for transformation that translate down to SQL.
- Graphical code-generation environments. Graphical environments for helping generate SQL.
- SQL-centric Pipeline/Workflow tools. Tools for composing small SQL queries into more extensive pipelines.
- And recently there’s a fourth category: AI coding assistants. New assistive technologies can ease the burden of coding, helping auto-suggest steps to programmers and save them the trouble of writing every code detail by hand. These assistants often appear as part of a low-code or no-code interface for data transformation.
There is no reason these approaches can’t co-exist. In fact, enabling people to do so, and share the space while choosing their own tools, enables democratization. That term is not only about making data engineering more accessible to less technical folks, democratization is also about bringing people together. This allows technical folks who are fluent in languages like SQL to co-exist and collaborate with newcomers and less technical users… all in the same software stack.
One more thing to keep in mind: Yes, SQL for transformation can get pretty gnarly. That said, SQL is often a much easier data language to manage and govern over time than programming languages like Java or Python.
In SQL, everything is “data in, data out.” It’s a data-centric language. There’s a benefit there whether you’re writing SQL by hand or taking advantage of tools to help generate SQL.