SQL is back in a big way this summer, after what seemed like a period of time in the back seat. To find out why and what we missed while it was gone, we asked a few questions of Joe Hellerstein, a computer science professor at UC Berkeley and co-founder of Trifacta. This is the first in a series of chats on SQL’s role in today’s data engineering.
Q: Joe, we hear you talking about 2021 as the “Summer of SQL.” What do you mean by that?
A: Broadly speaking, there are two types of Data Engineering shops: Java-centric and SQL-centric.
Java (and its younger sister Scala) is a general-purpose programming language for software engineers. Many open-source “big data” solutions in the last decade, such as Hadoop, Spark, Kafka, Storm, and Scalding, are designed for Java/Scala programmers. Java aligns well with ETL: Extract-Transform-Load. In this paradigm, you extract data from many sources, transform it using various tools, often Java/Scala-based, and then load the transformed data into a data warehouse for analysis or machine learning work.
On the other hand, SQL is a language specifically for manipulating data. Most cloud data warehouses, such as Snowflake, Amazon Redshift, BigQuery, and Azure SQL Data Warehouse support SQL, as do many traditional database engines including Oracle, MySQL, PostgreSQL, and Teradata. SQL aligns well with ELT, Extract-Load-Transform, where the steps are similar to ETL but ordered differently so that Transformation is done in SQL. With cloud data warehouses enabling scale and flexibility, it’s easier than ever to embrace ELT with SQL.
For the first decades of the Millenium, it seemed like the Java-centric approach was the “hot new thing,” but SQL has been roaring back. Today, SQL seems to be the focus of every data engineering conversation and popping back up on billboards in Silicon Valley.
The comparison of the two “shops” inevitably leads to the question: which is better? There are pros and cons to emphasizing one or the other.
One main issue is the skills and proclivities of your team. Java-centric shops need software engineers who are good at writing and maintaining custom Java code. SQL-centric shops do not write nearly as much custom code and instead emphasize SQL skills. The truth is that outside Silicon Valley, there historically haven’t been many organizations with long-term software engineering groups focused on data. This means SQL has a pretty vital role to play.
Q: Why is SQL Back Now?
A: In two words: The Cloud. The Cloud provided SQL data warehouses with two key properties: ease of getting started and scaling up and down.
Before the cloud, databases had barriers of being hard to manage or manage at scale. Now, businesses can set up a small database in the cloud, and it can grow as big as needed. The heavy lifting is all taken care of by the cloud vendors, including scaling databases as they change over time. Essentially, cloud data warehouses are infinitely scalable and allow flexibility in pricing to meet business needs as they evolve.
Said differently, SQL data warehouses are now far easier to deploy and manage than in the days of Oracle’s dominance. SQL is more accessible for many people. It’s easier for tools to interpret what the code is doing, and it illuminates how data is used and governed. SQL is “just another language” that data teams can choose to work with, not a commitment to administering a complex software stack.
Q: Why is SQL a good fit for Data Engineering?
A: There are four key reasons SQL is a strong fit for data engineering:
-
Declarative DSL for Data:
- Relative to a general-purpose language like Java, SQL is a “domain-specific language” (DSL), focused primarily on one task: querying and manipulating data. The narrow scope might seem to limit potential, but the focus of the language offers several benefits.
- SQL focuses on “what” you want from your data rather than “how” computers should get what you want. Every statement centers on two issues: what data is being accessed and what data is being produced. The technical description is that SQL is a “declarative” language. By contrast, a traditional programming language like Java focuses on “how” the program behaves—the step-by-step description of algorithms. Technically, these are “imperative” languages.
-
Data Lineage:
- As a data DSL, SQL offers very clear data lineage: since each statement has clearly defined data inputs and outputs, humans—and visual tools—can easily map out the “pipelines” of data through a set of SQL statements.
- Rich data lineage is powerful for many purposes. It makes it easy to see how data flows for tracking governance and compliance. It helps clarify how changes to your data sources may affect downstream results. And, it helps debug bad data outputs by easily tracing them to their sources upstream. Because languages like Java are general-purpose and not DSLs for data, there are many ways a programmer can access and emit data within an algorithm. This makes it difficult (for humans or tools!) to extract useful data lineage in general.
-
Scaling:
- SQL scales up and down naturally. Because queries say little about “how” to process data, a scalable cloud data warehouse has the freedom to choose deployments and algorithms that match the scale of your data. For small databases or simple queries, the warehouse can use very few compute resources; for massive databases and complex transformations, the warehouse can easily scale up the job.
- “Scaling up” was a big concern back in the early 2000s with the inventions of MapReduce and Hadoop. At that time, there was a perception in the open-source community that SQL wasn’t ideal because the easy-to-deploy database engines like MySQL and PostgreSQL did not scale up well. The cloud data warehouses have changed that story. It’s common knowledge that solutions like BigQuery and Snowflake scale up and down elegantly. Meanwhile, most “big data” solutions like Hadoop and Spark don’t scale down very well. Running any job often takes at least a minute, so they do not offer the interactive responses that humans need for exploration and debugging.
-
Cloud SQL is increasingly universal:
- SQL now runs on all your data, including your inexpensive cloud storage. Traditional databases like Oracle required you to load their proprietary storage engines to use SQL. Most of today’s cloud warehouses allow you to directly query the data stored in files in your Data Lake while offering specialized database storage for improved query performance.
What does this summer’s SQL comeback mean for the future? The democratization of data has never been more relevant. The next step is understanding data pipelines and breaking down the alphabet soup of ETL and ELT. Check back next week for the next installment.