Google Data Fusion und andere Transformationsdienste für BigQuery Dataprep

Technology   |   Bertrand Cariou   |   Oct 25, 2019 TIME TO READ: 4 MINS
TIME TO READ: 4 MINS

The good news is, when it comes to moving and transforming data for analytics built on top of BigQuery, the Google Cloud Platform (GCP) offers a plethora of GCP native choices. The bad news is that you have to make a choice among many options to ensure the technology will fit your needs and prove to be sustainable over time. 

This article is a check-list to help you choose the best technology (or combination of technologies) that is right for you. If your main database for reporting is BigQuery, you will need to clean, transform, and combine data for your analytics in batches and/or micro-batches. For this reason, we have ruled out Google BigQuery Data Transfer Service, which can only move data into BigQuery but can’t transform it. We limited the possibilities to only native GCP data transformation services, including Cloud Dataproc, Cloud Dataflow, Google Data Fusion (beta), and Cloud Dataprep. Sure, transformations can also happen in BigQuery using SQL, which is quite a powerful language; however, SQL does not make a sustainable solution for enterprise data transformation services for many reasons (this would need a complete article on its own).

So what are the criteria for selecting the right technology? 

First, are you looking to write code in a programming language for your solution?

If software engineers or developers have to develop and integrate various components to create an application that includes data transformation at scale, then you should look at Cloud Dataflow or Cloud Dataproc. These are code-based solutions that require programming skills such as Java, Python, RestAPI, or Apache Spark. These require fairly technical knowledge. With these services, you already know which transformations need to be created, and you need a development tool to code these transformations.To figure out whether to choose Cloud Dataflow or Cloud Dataproc, read this article, which provides a decision tree to help you make your choice.

Second, is your data already in GCP?

If your data is not yet in GCP, Google Fusion, the G
CP Extract, Transform, and Load (ETL) solution, provides numerous connectors to bring the data into GCP and BigQuery. Alternatively, you could leverage other ETL solutions you already have in-house that can bring the data into GCP. Google recommends these ETL vendors, including Matillion, Fivetran, Informatica, and many more.ETL solutions such as Google Data Fusion are best to develop large-scale data pipelines to move data from on-prem and cloud sources for “hydrating” a data lake or a Data Warehouse (DWH). Data engineers, ETL developers, and data architects will benefit most from Google Cloud Data Fusion. They will be able to develop solid real-time and batch-data transformation pipelines. Google Fusion and ETL are excellent in building stable data pipelines to move data routinely to a DWH or a data lake. They are not meant to create pipelines that evolve often.

Third, in addition to data engineers, does your project need self-service data transformation by business users and the flexibility to adapt to business needs?

Data preparation is the most recent evolution of data transformation technologies, bringing these capabilities in the hands of data-driven professionals who are less technical and often reside in a line of business (in addition to the usual technical data specialists such as data engineers). These professionals know the data better than anyone else however, they may lack the technical skills to clean, combine, and enrich the data to make it usable for their needs. Google Cloud Dataprep offers a grid-based, machine learning-guided visual interface to enable agile data exploration and assessment to refine, standardize, and combine data for BigQuery for analytics. Users interact with the content of data to iteratively refine and bring together data for their downstream business-driven analytics.

Google Data Fusion also generates Cloud Dataproc code to transform the data, while Cloud Dataprep generates some Dataflow code to transform the data. Both have their advantages based on your use case and your former experience. This could also be an element to weigh in favor of a particular technology to deliver analytics on top of BigQuery. 

Ultimately, you may need to combine multiple solutions to create end-to-end analytics or machine learning solutions for BigQuery. The following table summarizes these technologies’ characteristics to guide your choice.

Want to learn more about Dataprep, Data Fusion, Dataflow and Dataproc? Check out this article.

Tags