If you manage a data and analytics pipeline in Google Cloud, you may want to monitor it and obtain a comprehensive view of the end-to-end analytics process in order to react quickly when something breaks.
This article shows you how you can capture Cloud Dataprep jobs status via APIs leveraging Cloud Functions. We then input the statuses to a Google Sheet for an easy way to check the statuses of the jobs. Using the same principle, you can combine other Google Cloud service statuses in Google Sheets to obtain a comprehensive view of your data pipeline.
To illustrate this concept, we will assume you want to monitor a daily scheduled Dataprep job with a quick look at a Google Sheet to get an overview of potential failure. The icing on the cake is that you will also be able to check the recipe name and jobs profile results in Google Sheets.
This article is a step-by-step guide to the process of triggering Cloud Functions when a Cloud Dataprep job is finished and publishing the job results, status, and direct links into a Google Sheet.
Here is an example of a Google Sheet with jobs results and links published.
1. Getting Started
To make this guide practical, we are sharing it here in Github, the Node.js code for the Cloud Function.
You need a valid Google account and access to Cloud Dataprep and Cloud Functions to try it out. You can start from the Google Console https://console.cloud.google.com/ to activate the services.
REMARK: To call APIs, one needs an Access Token. One must be a Google Cloud project owner to generate this Access Token. If you are not a Google Cloud project owner, you can try it out by using a personal Gmail account.
2. Create the HTTP Cloud Function to Publish in a Google Sheet
First, we need to create the HTTP Cloud Function that will be triggered as a Webhook when a Dataprep job has finished.
Create a Cloud Function from the Google Cloud console here. The trigger type must be “HTTP”. Give it a name and get a URL similar to https://us-central1-dataprep-premium-demo.cloudfunctions.net/Dataprep-Webhook-Function. Later on, we will need this URL while creating the Webhook in Dataprep. In our example, we will use the Node.js, provided above, as the Runtime under the Source Code section.
If you want to explore more about Cloud Functions, check out this tutorial.
The Cloud Function code follows this logic:
- Retrieve the job id and status (Failed or Completed) provided by Dataprep.
- Leverage the Dataprep user Access Token in the API call in order to be authenticated by Cloud Dataprep.
- Get more information (status, recipe id) about the job with the getJobGroup Dataprep API call. Documentation on this Dataprep API endpoint can be found here: https://clouddataprep.com/documentation/api/#operation/getJobGroup
- Get information (name, description) about the job’s recipe with getWrangledDataset Dataprep API call. Documentation on this Dataprep API endpoint can be found here: https://clouddataprep.com/documentation/api/#operation/getWrangledDataset
- Publish information and links to a Google Sheet. Job result page and link to download Result Profile in a PDF is written in the Google Sheet:
- Job result URL is https://clouddataprep.com/jobs/<jobID>
- Job result profile in a PDF format can be downloaded from this URL: https://clouddataprep.com/v4/jobGroups/<jobID>/pdfResults.
The Node.js code is here. You need to edit and replace the highlighted values in red with the proper one you retrieved in your Cloud Dataprep project.
- Access Token to call Dataprep API:
- var DataprepToken = “eyJhdjkghdjkghjkghjkdfsghk”
- Google Sheet ID where you want to publish the results:
- const JobSheetId = “1X63lFIfsdfdsfsdfdsfN0wm3SKx-Ro”;
To retrieve the Google Spreadsheet ID, follow the explanations here.
- Google API Key:
- sheetsAPI.spreadsheets.values.append({
- key:”AIzaSyAN7szfsdfsfsdfLh0qu8qlXUA”,
To retrieve the Google API Key, follow the explanations here.
You also need to add the following dependencies to your Node.js Cloud Function (PACKAGE.JSON tab):
{
“dependencies”: {
“googleapis”: “^42”,
“request”: “^4.0.3”
}
}
You then need to deploy the Cloud Function. After it is deployed, the Cloud Function is running and waiting to be called from Cloud Dataprep when a job is executed. You can learn more here about deploying and executing Cloud Functions.
3. Create a Cloud Dataprep Flow and Configure a Webhook
Next, you need to create the Cloud Dataprep flow that will call the HTTP Cloud Function to publish the job result in Google Sheets.
You need to create and configure a Webhook task in your flow that will call your HTTP Cloud Function.
The Webhook task needs to be configured with this information:
- URL: This is the URL of the HTTP Cloud Function you previously created. For example, https://us-central1-dataprep-premium-demo.cloudfunctions.net/Dataprep-Webhook-Function.
- Headers: Use headers like those shown in the screenshot below with content-type and application/json.
- Body: Use the value {“jobid”:”$jobId”,”jobstatus”:”$jobStatus”} as shown in the below screenshot.
- Trigger event: You can decide to trigger the Webhook for any status or just for jobs failed or completed.
- Trigger object: You can decide to trigger the Webhook for only specific outputs in the flow, or for any job executed in the flow.
When you have entered this information, you can test your Webhook task that calls your Cloud Function.
After you save the Webhook task, it is then ready to be called when the job is executed.
4. Testing the End-to-End Process
You are now ready to test the end-to-end process by running a job from your Dataprep job and see the job result status added to your Google Sheet.
Lastly, you can also check proper execution details (API call with the parameter and Cloud Dataprep job status) by reviewing the Google Cloud Functions logs located here.
Conclusion
You should now understand the fundamental principles associated with automatically publishing Dataprep Job results in a Google Sheet, so you can monitor and share summary information easily to a broader team.
You have learned about
- Cloud Dataprep APIs
- Cloud Dataprep Webhooks
- Cloud Functions calling an API
You can also extend this solution to monitor additional Google Cloud services for end-to-end data pipeline monitoring.
You’re now ready to automate the monitoring of your job status. You can also automate Cloud Dataprep leveraging another Cloud Function or an external scheduler. Take a look at these articles explaining how to orchestrate Cloud Dataprep jobs using Cloud Composer and how to automate a Cloud Dataprep pipeline when a file arrives in Cloud Storage.