Mejoras en Cloud Dataprep: Macros, Transformar con el ejemplo y Limpieza de clústeres

Technology   |   Bertrand Cariou   |   Sep 11, 2019 TIME TO READ: 5 MINS
TIME TO READ: 5 MINS

Cloud Dataprep September ‘19 Release brings three great new features: Macros, Transform by Example, and Cluster Clean enhancements. Macros is an exciting new feature to provide a repeatable way to accomplish repetitive or common tasks by combining multiple data-wrangling steps into one action. With Transform by Example, Cloud Dataprep users can enter an example of how they would like their data to look, and Cloud Dataprep will leverage machine learning behind the scenes to create the recipe steps to get to that output. Finally, Cluster Clean enhancements now enable automated standardization to help users normalize a list of values more precisely.

All these innovations are a move toward making the data preparation process more intuitive, as they require no prior knowledge of syntax or step creation.

Let’s review all these cool new features that you can use immediately in Google Cloud Dataprep.

Macros

Macros are a set of steps grouped together and presented to users as one task within Cloud Dataprep. In the example shown below, we use three steps to create a macro to remove outliers. Here are the steps bundled into the macro:

  1. Create a column of the standard deviations,
  2. Create a column of the means, and
  3. Create a formula to flag outliers based on whether or not the value falls more than 3.5 standard deviations from the mean.

In the video below, we create a macro from these three steps, with the original column as a parameter that can be changed from recipe to recipe. Rather than creating these three steps from scratch or having to locate them in a separate recipe and copy and paste the work into the current recipe, we can instead find the macro in our library of macros directly from the transformer page, to reduce the busywork.

If needed, you can inspect a macro to see the underlying steps and verify the correct behavior. You can also parameterize more than just columns, including numbers, strings, patterns, booleans, and more, to really customize a macro. If you need to modify any step in a macro to tweak it slightly, you can also convert the macro back to the original set of discrete steps and then modify them.

Reusing a macro is as easy as selecting it and entering the required parameters.

Transform by Example

With Transform by Example, for any existing column, you can enter the desired output value for that column, and Trifacta will build the steps in the background to get you there. We’ll show a couple of different scenarios where this might be useful, but there are many more scenarios where this feature will be an extremely powerful part of your toolkit.

One of the most common tasks that analysts need to perform is pattern reformatting – converting multiple formats of data into a single format by manipulating delimiters, tokens, and word lengths, while preserving semantic content. For example, suppose you have a column of phone numbers that you’d like to reformat into the common +1 ### ### #### US format.

Visualizing phone number data

Writing out data transformations to solve this task can be a time-consuming and error-prone process, especially because the data may contain many different phone number formats, as shown above in the Patterns interface. With Transform by Example, rather than authoring transformations, you instead type out one or more examples of what you’d like your output records to look like, and Trifacta will create the transformation to get you there.

Typing out an example

After entering the example on the first row, Trifacta infers exactly the kind of transformation you’re trying to do. It applies this transformation to your input column and provides you with a preview of what your data will look like once committed. If you’re not satisfied with what Trifacta predicts, you can simply add more examples for different input records until you’re happy with the results. Finally, you can add the transformation as a step to your recipe.

Let’s take a look at another example. The event_time column above is not in the format we need for our downstream analysis. Additionally, there’s a data quality issue with having multiple different formats present. We can easily tackle both of these issues using Transform by Example. Cloud Dataprep allows you to verify that each distinct pattern present in a column is resolved as intended by looking at each pattern present, as shown below.

Cluster Clean Enhancements

This release also comes with enhancements to Cluster Clean, allowing for auto standardization. Auto standardization will standardize the values for which an algorithm can determine a clear primary value in a cluster.

For the full release notes, please follow the link here: click here.

Tags