What is Azure Data Lake and How to Refine It with a Modern Data Wrangling Solution

Technology   |   Paul Warburg   |   Dec 31, 2019 TIME TO READ: 5 MINS
TIME TO READ: 5 MINS

Microsoft Azure Data Lake is a highly scalable cloud service that allows developers, scientists, business professionals, and other Microsoft customers to gain insight from large, complex data sets in a cost-effective way.

Customers can use the data lakes on Azure to store an unlimited amount of structured, semi-structured or unstructured data from a variety of sources. The service does not impose storage limits on account sizes, file sizes, or the amount of data that can be stored in a data lake. 

With its elastic scalability, cost advantages, as well as an integrated suite of analytics services in the Azure ecosystem, Azure Data Lake builds the foundation for organizations to explore a wide range of analytics use cases against the large volume of diverse data stored in it. However, getting data ready to effectively fuel various analytics projects can be extremely time-consuming and costly without a proper data wrangling solution. To ensure clean, connected and trusted data is always available on Azure, you need a cloud-native, modern data wrangling solution that can streamline the tedious data wrangling tasks, enable self-service so everyone in the organization can wrangle the data, handle diverse data on Azure, and wrap around these key capabilities with enterprise-class data governance and security. 

A modern, intelligent data wrangling solution on Azure should be designed around four critical capabilities: 

  1. Tight integration with the Azure ecosystem

    A modern data wrangling solution on Azure should have native integrations with the rich services Azure has to offer. At the minimum, it should be integrated with the Azure Data Lake Storage (Gen1 & Gen2),  processing engines such as Azure Databricks, and security offering such as Azure Active Directory. By leveraging these mature cloud services, a data wrangling solution will promise customers elastic scalability, high performance, enterprise-grade security, as well as superior cost-benefit inherent to the Azure cloud. 

     

    In addition to the above, a holistic data wrangling solution on Azure should extend connectivity to other services including data integration service via Azure Data Factory, analytics services such as Azure ML and AI, and container orchestration via Azure Kubernetes to not only enable advanced analytics use cases that require diverse and changing data, but also to accelerate the entire data prep application development time.

  2. Empower everyone with self-service data cleaning, preparation & pipeline orchestration

    One of the primary reasons organizations move to a data lake on Azure is to store large amounts of diverse data -structured, unstructured and everything in-between, to support a wide range of analytics use cases. To make data and analytics truly pervasive, it is critical for the organization to provide a data wrangling platform that users with different skill sets can quickly master. While data engineers and developers are traditionally the de facto users of data management tools, a modern data wrangling solution should also empower business users, analysts and data scientists who understand the context of the data best to wrangle the data themselves in order to fulfill their specific analytics needs, as opposed to waiting for IT to deliver the data to them (which often results in delayed time to insights and less optimal analytics outcomes). For example, tools that offer simple and intuitive UI, visual and interactive data quality assessment and transformation, as well as real-time feedback on the impact of every transformation on data within the Azure cloud is highly desirable. An advanced data wrangling solution that can automate data transformation steps with machine learning-powered suggestions eliminates the need to code, therefore accelerating the entire data prep process. 

  3. Enterprise-class data governance and security

    While empowering business users with an easy-to-use data wrangling experience, the same data wrangling platform should also allow IT to easily manage data governance and security centrally. For example, by leveraging the same protocols and policies in Azure Active Directory to manage user data access for data wrangling, organizations can save tremendous effort and cost from having to configure and maintain separate security solutions. A data wrangling solution with tight integration with Azure Data Catalog, or other major data catalog offerings on Azure will provide data lineage and traceability for data transformation workflow, therefore ensuring compliance and audit-ability for the organization.

  4. Facilitate rapid deployment, workflow orchestration, and collaboration

    As the number of users and analytics use cases grow on top of the Azure Data Lake, a modern data wrangling solution should enable automated job orchestration, and support seamless collaboration across all stakeholders. Such a platform needs to foster a repeatable processes around job scheduling, publishing, and monitoring while allowing customization such as a user-defined job execution schedule. 

    To accelerate time to production, a modern data wrangling solution should provide quick-start offerings such as ARM templates, as well as easy deployment options through Azure Marketplace and a fully managed SaaS offering.

    When it comes to job orchestration and collaboration, a modern data wrangling solution needs to support not only the basic use cases such as automatic job execution in sequence with monitoring and alert capabilities, but also advanced features such as conditional triggers based on the metadata and output from the previous task, and custom email notifications to drive quick time to resolution. 

Modernizing your analytics on Azure is a multi-year marathon, not a sprint. While moving the data to the Azure Data Lake is the first step toward analytics success on Azure, a modern data wrangling solution will help you overcome the biggest obstacle on this journey – getting the data ready quickly to jump-start your analytics projects and get ahead of your competitions. Great analytics starts with great data, great data in Azure Data Lake can only be obtained with an intelligent, powerful, secure,  yet easy-to-use solution seamlessly integrated into the Azure ecosystem. 

To learn more, sign up for a free trial of Designer Cloud, a modern data wrangling solution, a leader in data wrangling.

Tags