Configurar el monitoreo de la calidad de los datos para pipelines de Cloud Dataprep

Strategy   |   Bertrand Cariou   |   Dec 21, 2020 TIME TO READ: 12 MINS
TIME TO READ: 12 MINS

Build a simple, flexible, yet comprehensive Data Quality monitoring solution for your Google Cloud Dataprep by Trifacta pipelines with Cloud Functions, BigQuery and Data Studio

Building a Data Quality Dashboard

Building a modern data stack to manage analytic pipelines—such as Google Cloud and a BigQuery data warehouse or data lake—has many benefits. One such benefit is the ability to automatically monitor the quality of your data pipelines. You can ensure that accurate data is fueling your analytics, track data quality trends, and, if a data quality issue does arise, react quickly to resolve it. 

Let’s review how to create a beautiful Data Studio dashboard to monitor Cloud Dataprep data quality pipelines.

For this article, let’s assume that you’re responsible for managing a data pipeline for a Sales Cloud Data Warehouse (DWH) and you want to monitor Data Quality (DQ) for this particular Sales DWH. You’ll have to create two distinct but correlated pipelines:

  1. Pipeline 1 for loading the actual Sales DWH
  2. Pipeline 2 for monitoring DQ 

When the Sales DWH is loaded, it will trigger the DQ Pipeline.

The solution from a high level perspective follow these principles that you will implement throughout this article:

  • Switch on job “Profiling Results” and “Data Quality Rules” for the Sales DWH datasets (Sales and Advertising Clickstream).
  • Execute the Sales Dataprep jobs to load the Sales DWH.
  • Execute a webhook when the Sales DWH is loaded to trigger the DQ Pipeline
  • Report on the Google BigQuery tables with Google Data Studio to track the Data Quality trends for the Sales DWH.

Here is an example of a Google Data Studio report you will be creating. This blog explains in more details what it provides.

Building the Data Quality Pipeline with Cloud Dataprep

Here is an overview of the entire solution used to capture Cloud Dataprep Data Quality statistics and load them to BigQuery. The automation piece comes from Dataprep jobs, while Data Studio handles reporting.

  • The upper lane represents the actual Sales DWH, which is loading sales and advertising data into a DWH. 
  • As soon as the Cloud Dataprep job is finished executed, it executes a webhook (a task) that triggers a second Cloud Dataprep job.
  • This second job in the lower lane will load the Data Profiling and Data Quality Rules results (in a form of JSON files) into a BigQuery table in order to report on it with Data Studio.

1. Pre-requisites

Be familiar with Profiling & Data Quality Rules: For the datasets you want to monitor DQ, you must enable “Profiling Results” and “Data Quality Rules”. Please be familiarized with these features. These are the essential statistics we will be leveraging to build the DQ Dashboard. Each time you run a job with Trifacta, some Data Quality statistics are created and can be accessed via APIs or through JSON files. We will be leveraging the JSON files to build the Data Quality database. You will have to enable Profiling & Data Quality Rules in the jobs and flows respectively. If you don’t have a flow yet to monitor, no worries, we will be leveraging the Sales DWH flow provided below in the article. 

Downloads: All the assets for this blog are available in this GitHub

Download ‘flow_Profiling Quality Rules Processing.zip’ which is the DQ flow that extracts the Profiling and Data Quality results and loads them into BigQuery tables
Downloadflow_Data Quality_Clickstream_and_Sales.zip’ which is the Sales DWH flow example used in this blog. If you wish, you can use your own Cloud Dataprep flow for which you want to monitor the Data Quality for.
DownloadAdvertising_Clickstream.csv and ‘download Sales_Data_small.csvwhich are the source example for the Sales DWH  

You can make your own copy of the Data Studio dashboard[PUBLIC] Cloud Dataprep Profiling & Data Quality Dashboard’ to customize it to your Data Quality monitoring needs

Service Access:

You will need a valid Google account and access to Cloud Dataprep and Google BigQuery in order to try it out. You can go to the Google Console https://console.cloud.google.com/ to activate those services.

Calling APIs: To call APIs you’ll need an Access Token, which you can generate from your Preferences page in Cloud Dataprep.

2. Create Cloud Dataprep Data Quality Flow

Import the DQ Flows in Cloud Dataprep

If not done yet, downloadflow_Profiling Quality Rules Processing.zip’ DQ flow and import it (without unzipping it) in your Dataprep environment. In the Cloud Dataprep application, click the Flows icon in the left nav bar. Then In the Flows page, select Import from the context menu.

These two flows parse the JSON files and convert them into columnar BigQuery table format for simplified reporting in Data Studio. If you’re curious, you can take a look at the recipes “Profiler Rules” and “Profiler Check” to understand the logic.

Prepare for the API call

You now need to get the ID of this Flow. This will be used to trigger the 2 jobs from the DQ flow you just imported with an API call later on.

You can get the id from the Flow URL:

In my example <flow_id> = 355168

Now, you need to customize this Flow to your own personal Dataprep configuration and settings. In the next section, you acquire the Profiling and Data Quality Rules output that were generated as part of one of your job runs (in our example the Sales DWH). Then, you feed them into your DQ pipeline to load them into BigQuery.

Get the Profiling and Data Quality rules results file names

When Profiling is enabled and if you have Data Quality Rules defined, Dataprep generates 3 JSON files at the end of the job execution located into your Google Cloud Storage staging bucket in this default folder :

/<user_name>/jobrun/<output_name>_<jobid>/.profiler

Where:

  • <user_name> = GCP user (account email) running the Dataprep job
  • <output_name> = Dataprep Output object name
  • <job_id> = Dataprep job id

And the 3 (multi-part) JSON files generated are:

  • ‘profilerRules.json’ = Data Quality Rules results
  • ‘profilerTypeCheckHistograms.json’ = Profiling information with Valid, Missing an Mismatched from all columns
  • ‘profilerValidValuesHistograms.json’ = Profiling information with top 20 and statistics from all columns

For our current example, our <staging_bucket_name> is “dataprep-staging-0b9ad034-9473-4777-98f1-0f3e643d0dce”, and we have used the default <jobrun_folder>. For this demonstration we will only use the two first JSON profiling files. You may want to extend later on the solution with the last data quality file that contains doe extra statistics.

Note that the default <jobrun_folder> used in this example can have been changed in your own environment. From the User Preferences menu in the left nav bar you can verify the <jobrun_folder> used for your account. 

Update the DQ Flow with new Parameterized Inputs Datasets

We can use the two files from a job execution (‘profilerRules.json’ and ‘profilerTypeCheckHistograms.json’) as the basis for creating the imported datasets for the DQ flow. These flows are accessed via paths. 

With Cloud Dataprep, when you create a new imported datasets, you can parameterize parts of the paths, which allows you to create imported datasets that match all such files from all of your job runs. These datasets are called datasets with parameters.

In the “Profiling & Quality Rules Processing” Flow previously imported, you need to update the two “Profiler input datasets” with your <staging_bucket_name> and <jobrun_folder>.

For this, we advise you to run the job a first time with Profiling Results and Data Quality Rules enabled so you can later on update the flow with an existing path and profiler json files.

Edit parameters for the 2 datasets with the proper path.

First from the Input Dataset Details panel, select the Edit parameters… item from the right nav menu “…”.

Then with the Browse button, you can find your staging bucket and create the path with the 3 variables <user_name>, <output_name> dataset name and  <job_id> to find your Profiler json files like below. I advise first to find an existing profilerRules.json file and then replace <user_name>, <output_name> dataset name and <job_id>, with the 3 variables.

After your 2 input datasets have been updated, check that you can see some data in the Data Preview from the Details pane for each of the recipes.

If you cannot see any data, verify that your path is correct and that you can see data in the dataset.

You can also edit the 2 recipes to check input dataset and research what could have gone wrong.

We won’t explain these recipes in detail here, but you can edit and modify them to tune them to your needs.

Update the DQ Flow with BigQuery outputs tables

Finally you need to update the outputs for the 2 recipes to ingest (in append mode) the Profiling and Data Quality Rules results into 2 BigQuery tables. These 2 BigQuery tables will be automatically created by Dataprep the first time you run the job, then the next run will just append new data in these 2 tables. This way you will keep track of the DQ history.

Checkpoint: You have successfully imported the data quality flow, customized it for your needs with proper file paths, and run the 2 DQ jobs to populate the BigQuery tables.

You can finally run the job to produce these 2 outputs, and check that your 2 BigQuery tables are populated with Profiling and Data Quality Rules results.

3. Create a Cloud Dataprep Flow and Configure a Webhook

Ok, we’re almost done—but not quite. We now need to create the Sales DWH flow available here as an example, which contains the datasets we’ll use to monitor our data quality. You may prefer to use one of your existing flows in your own Dataprep project. Within the flow you have selected to monitor the datasets data quality, we will call the above-mentioned Dataprep DQ Flows to publish the Profiling and Data Quality Rules into your Google BigQuery tables.

Calling the DQ flow will be done thanks to a Webhook (an external task call – if not familiar with webhooks, you can read the documentation here) notification letting you define outgoing HTTP messages to any REST API ().

Let’s see what it looks like.  Here is how to configure a Webhook task in your flow that will call your “Profiling & Quality Rules Processing” flow and run the two jobs.

The Webhook task needs to be configured with this information:

  • <url>: It’s the Cloud Dataprep API endpoint to run the “Profiling & Quality Rules Processing” flow.

The default Cloud Dataprep API endpoint to run a flow is https://api.clouddataprep.com/v4/flows/<flow_id>/run

Where <flow_id> is the ID of the “Profiling & Quality Rules Processing” flow you have retrieved from the previous step.

  • Headers like showed in the screenshot below with <content-type> and the value <application/json>, <authorization> and the value <Bearer access_token>

The Dataprep <access_token> is used to call the Dataprep API, see “Getting Started” chapter.

  • Body with the below value as shown in the below screenshot Note that for the <user> you need to put the GCP’s user account (email) running the Dataprep job and for <output_dataset> the name of the output object defined in your current flow.
  • Trigger event, you can decide to trigger the Webhook only when the job is successful.
  • Trigger object, you have to trigger the Webhook for only a specific output in the flow (the one you put in the Body parameter).

When you have entered this information you can test your Webhook task.

After you save the Webhook task, then it’s ready to be called when the job is executed.

Don’t forget to check Profile results in Publishing settings of your output. Without checking this, the profiling results will not be generated and it will defeat the whole purpose of the solution. 

Checkpoint: You have successfully configured a Webhook on the Sales DWH Flow that triggers when the job has finished, the webhook triggers the DQ Flow to populate BigQuery DQ Tables. 

You can configure a Webhook for each of outputs of your DWH flows if you want to get and monitor Profiling and Data Quality rules of all your outputs.

4. Testing the end-to-end process

You are now ready to test the end-to-end process by running a job from your Dataprep job and see the data quality statistics in your Data Studio report.

From the Sales DWH flow, run a job (by clicking Run Job button) on the “Advertising_Clickstream” output for example. 

When the “Advertising_Clickstream” job is complete, you can check in Jobs result page, from the Webhooks tab of the job, that the Webhook have been well triggered:

You can also check in Jobs result page, that the 2 jobs “Profiler Rules” and “Profiler Check” have been started:

And when these 2 “Profiler Rules” and “Profiler Check” jobs are complete, you can check into your 2 BigQuery tables that the Profiling and Data Quality Rules results have been well inserted in the tables:

Finally, you can compare the Data Quality Rules in the Dataprep UI with the Data Quality Rules in your Data Studio dashboard. It should be the same values.

Conclusion

By achieving this step-by-step guide, you can now publish Dataprep Profiling and Data Quality Rules results to Google BigQuery tables, which further allows you to create a Data Studio dashboard that monitors Data Quality trends and displays the information to your broader team. 

You have learned some Cloud Dataprep advanced techniques about:

You can extend the solution to monitor additional Google Cloud services for an end-to-end data quality pipeline monitoring.

You are now ready to automate the monitoring of your entire Data Quality pipelines with Cloud Dataprep.

Tags