Maintain BigQuery data lineage by enriching Google Cloud Data Catalog tags with Dataprep metadata and profiling results
Cataloging Dataprep Pipelines
Google Cloud Data Catalog is the defacto metadata cataloging solution for your analytics initiatives on Google Cloud. Data Catalog natively and automatically captures BigQuery datasets, tables, and views, which gives you visibility into your data warehouse or data lake schema’s organization.
Despite its strengths, however, Data Catalog does not capture how and when a specific table has been loaded, which is often critical information to understand the lineage of your data and evolve your analytics pipeline. The solution? Double up the work of data engineering cloud platform Google Cloud Dataprep by Trifacta to manage your metadata on top of preparing data for your data warehouse or data lake. Both Data Catalog and Dataprep offer APIs to capture metadata and, with a little bit of Cloud Functions scripting, you can maintain the metadata in sync and build end-to-end data engineering visibility.
This article describes how to synchronize Dataprep jobs and profiling results with Data Catalog service. In this guide, we assume that you are already familiar with Google Cloud Dataprep and know how to design a flow and run a job. If you need a Dataprep refresher, please go through this tutorial.
Getting Started
For the purposes of this how-to guide, let’s assume that you’re responsible for maintaining the data pipeline for a Google Cloud Data Warehouse (DWH) leveraging BigQuery and you want to track Dataprep jobs and profiling results for this particular DWH. You’ll have to follow these 2 principles:
- Execute the Dataprep data pipeline that loads data into the DWH
- Call a Cloud Functions at the end of the pipeline to load Dataprep jobs and profiling metadata into Data Catalog
After you set up this process, it will run automatically—each time the DWH refreshes, Data Catalog will also be refreshed. You will always have the latest information within Data Catalog regarding your DHW data pipeline activity.
We will be leveraging Data Catalog tags to record Dataprep job results information and profiling results.
To activate this solution, you will need to:
- Extend Data Catalog with tag templates
- Create the Cloud Functions that synchronize Dataprep metadata with Data Catalog.
- Setup your Dataprep pipeline to call the Cloud Functions (webhook) at the end of the pipeline
- Activate “Profiling Results” for your Dataprep jobs.
- Execute the Dataprep jobs to load the DWH.
- Check that the Data Catalog tags are updated with the Dataprep metadata corresponding to the DWH BigQuery tables and columns.
To activate, learn, and use Data Catalog, go to https://cloud.google.com/data-catalog and https://console.cloud.google.com/datacatalog.
Create Data Catalog Tags Templates
Before creating and updating Data Catalog Tags on BigQuery tables (the DWH), you will need to create Data Catalog Tag Templates based on the Dataprep jobs and profiling information you want to track.
Follow these GitHub descriptions to create your two templates:
You can select one of these three options to create the tag template in Data Catalog (the first one is the easiest one if you are not familiar with command line or API):
- Visit the Google Cloud Console where you can visually manage your Tag Templates.
- Use gcloud and the command “gcloud data-catalog tag-templates create”. You can find the correct command line in gcloud_tag-templates_create.sh, and more details in Google Cloud documentation with an example and the reference documentation. But be aware that with a gcloud command line, you cannot manage template tag fields’s order, fields will be in alphabetical order.
- With Google REST API calls with the 2 tag template json files dataprep_metadata_tag_template.json and dataprep_column_profile_tag_template.json. Explanation and details to use the REST API in GCP documentation with an example and the reference documentation.
Dataprep Jobs Metadata Template Tags
Dataprep Jobs Metadata tag template contains information from the Dataprep job used to create or update the BigQuery table :
- Dataprep Job (id, name, url, timestamp)
- Dataprep user
- Dataprep Dataset (id, name, url)
- Dataprep Flow (id, name, url)
- Job Profile (url and number of valid, invalid and empty values)
- Dataflow job (id, url)
Dataprep Column Profile Metadata Template Tags
Dataprep Column Profile tag template contains each BigQuery table column’s profile with a number of valid, invalid and empty values.
After you have created the tag templates for Dataprep jobs and profile results, proceed to the next step.
Create the Cloud Function that synchronize Dataprep metadata with Data Catalog
The Cloud Functions will create or update Data Catalog Tags for the DWH BigQuery tables, based on your Data Catalog Tag Templates previously created.
To create a Cloud Functions from the Google Cloud console, click here.
The trigger type must be “HTTP” and in this example Python 3.8 is the Runtime engine used.
You can find in this Github repository the Cloud Functions Python code to create or update the Dataprep Data Catalog tags.
This Cloud Functions uses:
In your Cloud Functions, you need the 5 files:
- main.py
- config.py where you need to update your Google Cloud project name (where Tag Templates are created) and your Dataprep Access Token (to use Dataprep API). You can also update the 2 tag templates ID if needed.
- datacatalog_functions.py to get or update Data Catalog objects.
- dataprep_metadata.py to get Cloud Dataprep metadata.
- requirements.txt
After you created and activated the Cloud Functions, you can follow the below steps that call the Cloud Functions at the end of the Sales DWH Dataprep pipeline.
Dataprep Flow Configuration
In the Dataprep Flow which is loading the DWH BigQuery tables, you need to create a Dataprep Webhook to call the Data Catalog Cloud Functions we just added. Then, each time a Dataprep job is run successfully, Data Catalog tags for the DWH BigQuery tables are updated (or created for the first run) with Dataprep jobs and profiling information.
The Webhook URL is your Cloud Function endpoint, and in the POST body you need to pass the Dataprep job ID with {“job_id”:”$jobId”}.
Testing the end-to-end process
To test the end-to-end process, you can simply run a Dataprep job in your Sales Flow. And if the job is successful, the Data Catalog Tags will be created (or updated) on the BigQuery table used in the Dataprep output.
Data Catalog tags results can be found in the GCP console interface https://console.cloud.google.com/datacatalog.
Conclusion
By achieving this step-by-step guide, you can now update Data Catalog Tags on BigQuery tables based on Dataprep job Metadata and Profiling results.
You have learned some Cloud Dataprep advanced techniques about:
You can extend the solution to add more information to Data Catalog tags or to create tags on other Google Cloud Dataprep objects.
You are now ready to automate your Data Catalog tags from all Dataprep jobs and profiling information.
All source code can be found in the Github repository explaining how to update Google Cloud Data Catalog tags on BigQuery tables with Dataprep Metadata and Column’s Profile via a Python Cloud Functions.
To learn more about Google Cloud Dataprep by Trifacta, check out our solutions page.