Alteryx Analytics Cloud avec Databricks

People   |   Marc Wellman   |   Mar 6, 2024 TIME TO READ: 5 MINS
TIME TO READ: 5 MINS

Alteryx Analytics Cloud supports an insight-driven organization with unified solutions from automated data preparation, approachable machine learning, and AI-generated insights.

Databricks unifies data warehousing on a single platform with its Lakehouse architecture, while Alteryx enables self-service analytics with an easy-to-use experience. Alteryx and Databricks empower everyone in your organization to get more value from data by making it faster and easier to access and process data.

This blog examines the integration between Alteryx Analytics Cloud and Databricks on AWS.

A diagram of a modern enterprise data platform integrating Alteryx, AWS, and Databricks. At the top, three company logos: Alteryx, AWS, and Databricks are shown. The diagram is divided into sections representing different stages of data processing. On the left, 'Data Sources' are listed, including Databases, Structured Data, Semi/Unstructured Data, and Streaming Data. These feed into the 'Transform & Enrich' stage where Alteryx processes raw data with a 'Designer' tool, and Data Lake with 'Delta Lake' and 'External Storage'. The 'Spatial' stage is handled by Alteryx with 'Location Intelligence'. The 'Analytics' stage shows 'Analytics Cloud Platform' with 'Machine Learning' and 'Auto Insights'. In parallel, Databricks handles 'Streaming' with tools for 'Bronze', 'Silver', and 'Gold' data processing steps involving 'Ingestion', 'Streaming', 'Cleansed', and 'Business'. The 'Machine Learning' stage involves 'Spark Processing'. The final output leads to 'Reporting & Business Outcomes' with the Tableau logo, indicating integration for data visualization and outcomes like predicting marketing campaign success, determining customer lifetime value, forecasting demand to optimize inventory, and automating year-end financial reconciliation.

Databricks complements Alteryx Analytics Cloud from two perspectives:

Connectivity

Databricks offers managed tables that can be accessed like cloud data warehouse (CDW) and relational sources such as Snowflake or Postgres. It follows the Lakehouse architecture, storing underlying data using open storage formats such as Parquet in customer-controlled cloud storage while also providing CDW-like capabilities such as ACID support, schema enforcement, and centralized governance. Alteryx Analytics Cloud enables users to easily connect to Unity Catalog and leverage data in Databricks for transformation and analytics.

Execution

Databricks offers a managed offering of Apache Spark for processing data at scale. More recently, Databricks has released Databricks SQL with performance superior to Spark. Alteryx Designer Cloud enables Analysts to leverage both Databricks Spark and SQL for workflow execution using a no-code experience.

Connecting to Databricks

The Alteryx Analytics Cloud Administrator adds one or more Databricks workspaces to an Analytics Cloud workspace by specifying the Databricks service URL and a Personal Access Token (PAT).

Admins and users with permission to create connections then easily connect to these Databrick workspaces by using the Create Connection feature and providing the connection properties.

A user interface window for creating a new connection to Databricks. The header says 'Create Connection' with a 'Learn more' link and a close button at the top. Under 'CONNECTION INFORMATION', there are fields for 'Connection Name' pre-filled with 'Databricks', an optional 'Connection Description' field, and 'Workspace Name' with 'Demo Databricks Workspace' selected from a dropdown menu. It notes that only 1 connection can be created for each Databricks workspace. Below is a 'Personal Access Token' field with a series of dots representing a hidden token, described as the personal access token generated in Databricks. An optional 'Connect String Options' field is also present. At the bottom, there is a 'Test Connection' button to verify the connection settings, and 'Back', 'Cancel', and 'Create' buttons to navigate or submit the form.

Importing References to Databricks Tables

With a Databricks connection defined, the Alteryx Analytics Cloud Data Import capability is used to display and explore tables within the Databricks Unity Catalog. Selecting the table displays metadata about the table and a small sample.

Screenshot of a database selection interface from Demo_Databricks / salesconnector_catalog. On the left side is a navigation panel with a search bar, listing database schema names including 'hr', 'information_schema', 'nyctaxi', 'result', 'retail', 'salesconnectors_schema', and 'tpch'. An arrow is pointing to 'salesconnectors_schema'. On the right side, the 'Choose a table' panel shows tables within the 'salesconnectors_catalog' schema, such as 'customer', 'lineitem', 'nation', 'orders', 'part', 'partsupp', 'region', and 'supplier'. Below 'customer', a table is partially visible with columns like 'c_custkey', 'c_name', 'c_address', and 'c_nationkey', populated with rows of data.

Data Preparation with Designer Cloud and Databricks

Alteryx Designer Cloud provides data preparation capabilities for business users with a visual, interactive, and AI-powered platform to ensure clean, connected, and trusted data is available to support data services, modern BI / Reporting, and AI / ML initiatives.

Once a connection to a Databricks workspace is added, a Designer Cloud workflow can reference Databricks tables for input and output. Additionally, once the admin enables the Databricks runtime, Databricks Spark or SQL can be used as the execution engine for workflow processing.

A flowchart representing a data processing workflow using Databricks. The workflow starts with 'Databricks Inputs' on the left, which splits into two branches. The top branch processes 'CustomerDemographic' data through 'Change Date Types' and 'Data Cleansing' steps before applying an 'Age < 65' filter. The bottom branch processes 'CustomerCredit' data, also through a 'Change Date Types' step, but then encounters an error indicated by a red exclamation mark symbol. The two branches converge on a function calculating 'AvgSpendPerTrans', resulting in a 'Databricks SQL output using pushdown' indicated by a purple Databricks icon. The flowchart has various icons including databases, a funnel, a calendar, a cleansing brush, and mathematical symbols. The 'Run Job' option is available at the top right corner with a dropdown for 'Databricks Runtime' selection

Pushdown Processing into Databricks Serverless SQL

In the case when all workflow inputs and outputs reside in Databricks, Designer Cloud will generate the transformation logic into native Databricks SQL, which is pushed down and executed in Databricks. This is the most performant execution method for the scenario in that no data egresses from Databricks, and Databricks SQL Serverless compute is used for workflow execution.

Databricks Spark Processing

Any Analytics Cloud-supported input to any supported output can be processed with Databricks Spark as a non-pushdown engine option. Databricks Spark is suitable for larger data volumes that benefit from distributed processing on a compute cluster. A Spark cluster is temporarily created for each job.

Monitoring Designer Cloud Jobs Run in Databricks

The execution stages section of Alteryx Analytics Cloud jobs detail page will display “Databricks SQL” or “Spark (Databricks)” for the Transform Environment for Designer Cloud jobs that are run in Databricks. Job status, runtime duration, and output data are also provided on this page.

"A screenshot showing two execution stages of a data process. The first stage is labeled 'Transform', which was completed on 10/19/2023, started on the same date, and ran for 1 minute. It specifies the environment as 'Databricks SQL'. Below it, the second stage is labeled 'Publish', also completed on 10/19/2023, started on the same date, and ran for less than 1 second. There is an activity log below this stage, showing 'customer_demographic_databricks' with a green checkmark, indicating that the process has been completed.

Additional Databricks-specific job information is displayed within the Databricks UI Job Runs page. Workflow jobs executed with Databricks SQL include “SQL” within the job name. Jobs without “SQL” in the name were run with Databricks Spark. Selecting the job name link displays the job run details page that includes query information for SQL jobs and links to logs for Spark jobs.

A partial screenshot of a job run history log from a data processing interface. There are columns for start time, job ID, run as, launched by, duration, and status. Three job runs are listed with start times on Oct 19, 2023, at 12:00 PM, 11:35 AM, and 09:23 PM respectively. Each job is identified as 'ABC-Transform-job' followed by a unique ID number. All runs were executed by a 'Solutions Architect' and launched manually. The durations of the runs are 4 min 6 sec, 1 hr 17 min, and 4 min 4 sec respectively, and all have a status of 'Succeeded', indicating successful completion.

Summary

Alteryx Analytics Cloud and Databricks work together to accelerate getting from source data to business insights in the cloud. Business Analysts can leverage cloud native processing and at-scale data using a simple no-code platform.

You can learn more about Databricks and its integration with Alteryx Analytics Cloud using the resources below:

Databricks
Alteryx Analytics Cloud Help – Databricks Connections