Explorar la IA generativa: la importancia de los datos y el flujo de trabajo analítico

Strategy   |   Jawwad Rasheed   |   Jun 25, 2024 TIME TO READ: 12 MINS
TIME TO READ: 12 MINS

We find ourselves at the forefront of another paradigm shift. The transformative impact that AI has already made on all our lives has been staggering – both in terms of its breadth of coverage and its stealth-like approach by which it has integrated into our existence. The emergence and promise of generative AI (genAI) has further fuelled expectations of business leaders who are optimistic and uncertain in equal doses as to what the future holds. A recent study by the IBM Institute of Business Value found that three out of four of the 3,000 global CEOs interviewed believed that competitive advantage will depend on who has the most advanced genAI capabilities.

How will the evolution of AI, specifically genAI, impact the lives of data professionals? And what role will Alteryx play in aiding the transition to the future state, if any? To help answer these questions, let’s start with the fundamentals to make sure we are on the same page.

Who’s who in the data world, and what do they do?

Data engineers are responsible for designing, testing, and implementing data pipelines that organize and integrate data from various sources. Data architects define and execute data integration strategies and maintain the data architecture, monitoring data quality and integrity. Both data engineers and architects are proficient in database management and typically have excelled in at least one programming language to codify data processing.

A Venn diagram illustrating the roles and responsibilities of Business Analysts, Business Leaders, Data Engineers, Data Architects, Data Analysts, and Data Scientists in relation to data usage and analysis within an organization.

Data analysts and data scientists have been the consumers of the data pipelines, alongside business users. Data analysts are proficient at data extraction, manipulation, and investigation to identify data insights and trends that may not be immediately apparent. Data scientists may apply more advanced statistical techniques to large datasets and develop models to address complex business requirements. Both data analysts and scientists rely on good communication through effective data visualization to relay their findings and typically have strong coding experience.

A flow chart illustrating the traditional, linear workflow for data projects. Business analysts and leaders define requirements that are passed to IT. Data engineers and architects build data pipelines based on those requirements. The resulting data is then passed to data analysts and scientists for analysis and insight generation.

The impact of hyperautomation and low-code AI solutions

Let’s overlay the rise in hyperautomation, digitization, and self-service capabilities in the last two decades, which have leveraged intelligent automation and ‘traditional’ AI capabilities (as a delineation from genAI capabilities). The adoption of low-code AI/ML solutions (like Alteryx) that simplify data access, transformation and analysis for Business and IT have had several effects across an enterprise.

First, business analysts have replicated the ability to create data pipelines and perform data analysis that historically has resided within IT, often within a longer list of misprioritized business requests. Alleviating this bottleneck has improved speed-to-market for business leaders and enabled IT data professionals to focus on more complex requirements that need their coding and programming expertise. This has led to increased IT ownership for developing intricate data pipelines and running advanced data analysis.

Second, these tools have led to the growth of data citizens and promoted data literacy across enterprises by empowering business lines and functions (such as Finance or HR), who unsurprisingly understand their data the best, to drive domain-specific insights.

And third, the tools have helped to bridge some of the gap between business lines and IT by creating dotted management lines between the functions and promoting agile and DevOps practices and culture. I say ‘some of the gap’ as there is still a long way to go for many corporations that are not natively digital.

 A flow chart illustrating the changing dynamics in business adoption of data capabilities through the rise of hyper-automation and self-service tools. It shows the traditional IT-centric model shifting towards a more collaborative approach, where business analysts and leaders play a more active role in defining requirements and using low-code AI platforms to simplify data processes, ultimately leading to faster time-to-market and more advanced data insights.

What about using genAI for data analysis?

Now consider the rise of genAI. Over the last couple of years, we’ve seen a huge rise in multimodal large language models (LLMs) that offer data analysis capability and the development of open-source small language models (SLMs) fine-tuned to their areas of focus and operating domain. For example, OpenAI proudly promotes its Data Analyst plug-in on ChatGPT Plus with continuous releases of new features. The basic capabilities include understanding datasets and completing tasks in natural language, whereby uploaded files will be analyzed via ChatGPT by writing and running Python code without user intervention. It also promotes its ability to handle a range of data tasks, including merging and cleaning datasets, data exploration, creating visualizations, and uncovering insights while maintaining data privacy. In parallel, developers and data enthusiasts alike are creating prompt lists that help execute data and analysis tasks without ever needing to define code or scripts.

These features will be very compelling for any newbie wanting to engage the power of data analysis without prior coding experience and help alleviate the fear of jumping into the data deep end. These advancements will undoubtedly be impressive for the savvier data professionals, though they also raise an eyebrow to question the breadth and depth of data analysis capability.

There are limitations in existing data analysis features in genAI models that are available today, which may be surprising to some. For example, the largest file sizes that can be uploaded are typically less than 1GB, with OpenAI Data Analyst GPT capping upload size to 512MB and ten files at one time. Hallucinations may still be present in the conclusions drawn, as transformer models are directed to find an answer even where one may not exist. In other instances, an explanation may be provided on how to manually compute an answer required, though the calculated result will not be provided due to tool limitations. Visualizations are not immediately informative and typically require effort for refinement with a succession of appropriate prompts.

With that context, I would view the current genAI data analysis capabilities as complementary for non-data professionals who do not need to work with large datasets or run complex analyses. I have assumed these non-data professionals, typically business users, know what they want to query in the first place, which, in practice, is not always the case. Data professionals will continue to use genAI models to curate, refine, and augment code that can only be understood and applied by those familiar with the programming language of choice.

Let’s assume a nirvana state where these existing limitations are removed, and the need for constructing data pipelines and designing analysis workflows is entirely alleviated by directly querying a machine to generate results. Is that a viable state, and what are the implications to get there? What role will Alteryx play if users can query LLMs or fine-tuned models to address their data analysis needs?

The continued relevance of data and analytical workflows

I’ll start by making a single statement responding to the questions above. GenAI tools will not overrule the necessity for data and analytical workflows – on the contrary, it reinforces the need. Here are eight reasons why:

A table outlining eight reasons why Generative AI is not the silver bullet solution for the banking industry and emphasizing the continued need for robust data pipelines. The reasons include: GenAI not applicable to all banking use cases (e.g., forecasting) Paramount importance for explainability in reported outcomes Unyielding need for reliable and repeatable outcomes Impracticable redesign of model risk governance frameworks and controls Uncertainty on data quality for AI model training or inference Privacy, security and IP concerns in AI model training and inference High cost for training and fine-tuning models and associated ESG fallout Significant skills gap and resource upskilling required

  1. No Silver Bullet: GenAI should not be considered the single answer or enabler for all use cases. Like any other utility, genAI has its role in simplifying our lives alongside other tools. For example, forecasting models must rely on high quality and trusted data inputs and conventional machine learning algorithms to generate accurate projections, as can be delivered on the Alteryx platform. A genAI model could then be overlayed to explain the forecasts in natural language but would not be the tool that generates the forecast itself.
  2. Need for Explainability: Maintaining explainability in reported outcomes will be paramount, considering, for example, scrutiny of financial statements or regulatory reports. Unpacking genAI models underpinned by neural networks with trillions of parameters and tokens to explain how information has been curated cannot be a feasible exercise corporations would be willing to conduct from a resourcing or ROI perspective.
  3. Need for Repeatability: The demand for consistency and repeatability in outcomes sits alongside explainability. Corporations will strive for model outputs that produce the same set of results based on all other factors being held constant. Their reliance, therefore, must be placed on the transparency of data lineage, including data sources, transformations, derivations, and calculations, as managed via the Alteryx platform. Mitigating model hallucinations whereby inaccurate or fictitious information is curated or different answers are produced for the same prompt remains a challenge. For instance, a genAI model might interpret a temporary market anomaly as a long-term trend, leading to outputs that could misguide decisions.
  4. Model Governance: The complications in explaining to users how any given answer is produced via an AI model creates a significant challenge for governing the models. What information would be needed to review and approve models, and how would the models be assessed? To what extent is the redesign required to model risk governance frameworks and controls, and is the redesign even feasible? Will the models produce reliable outcomes that can be reconciled to their design? Knowing that even non-complex regression models can have long lead times for approvals, companies may decide that this battle is not worth tackling bottom up. Instead, the preference or mandate may be to align with the direction provided by governments and regulators, such as those defined by the EU AI Act, which proposes the creation of AI model inventories with risk classifications.
  5. Data Quality Uncertainties: The reliability of AI model outcomes will depend on the quality and integrity of the data used to train the model. Reliance on closed-source LLMs becomes diminished when attempting to tackle domain-specific problems, which has seen the rise of organizations adopting open-source SLMs fine-tuned on domain-specific data, often managed within private cloud instances. Training SLMs will require a high level of certainty on data inputs delivered via platforms like Alteryx that can manage a network of complex data pipelines. Mitigating algorithmic bias from AI models due to imperfect training data or engineering decisions during development will remain challenging without a clear data management strategy. Just as a good coach is critical to developing athletes so they can train correctly with the most suitable drills, the quality of inference from the AI models is equally correlated to the quality of its training dataset.
  6. High Training Costs: The costs associated with fine-tuning AI models will raise the eyebrows of CFOs during budgeting cycles. The at-scale deployment of genAI will require either using dedicated hardware or significantly increasing cloud workloads. Should that happen, cloud providers will naturally seek compensation, even if they currently subsidize genAI experiments and development. Training and deploying foundation models may also increase carbon emissions and exceed environmental, social, and governance (ESG) commitments or expectations. Minimizing the total cost of ownership of processing power may be the game-changer, particularly in light of announcements from NVIDIA and their Blackwell GPU architecture that aims to power the new AI revolution. All industries are eagerly awaiting the successful deployment of low-cost AI hyper-scalers.
  7. Intellectual Property and Privacy Concerns: Significant attention is required to assess potential intellectual property risks using training data and generating model outputs, including possible infringement on copyrighted, trademarked, patented, and otherwise protected materials. In addition, genAI may heighten privacy concerns through potentially unintended use of personal or sensitive information used in model training. New applications may be subject to security vulnerabilities and manipulations – for example, users or bad actors may bypass safety filters through obfuscation, payload splitting, and virtualization. These factors can lead to the creation of activities that are resource-intensive and continuous, as well as adding to the legal complexity. Organizations must consider how to integrate data platforms with robust governance that help alleviate data privacy and security concerns for training data and model inference.
  8. Skills Gap: Our broader understanding of where and how genAI will impact our lives is still evolving. We are experiencing a revolution where the pace of technological innovation is surpassing the skills to manage the change. Academic institutions are catching up to introduce curricula that develops AI skill sets at the grassroots level, though the transition from academia into industry is a slow-drip process. Upskilling existing resources at scale creates a similar challenge, where keeping up with the pace of change may not be manageable. If AI models were scaled to complement data processes, then there should be more attention to upskilling data professionals who need to understand the data that trains the models, know how best to prompt the models, and can back-test model outputs to provide assurance on results.

In conclusion, the future remains bright for the Alteryx platform, which will play a critical role in the design, development, and execution of data and analytical workflows – the need for which will only amplify with the growth of genAI. The demand for repeatable and explainable results delivered at low cost will intensify.

Alteryx has fully embraced the leap forward with Alteryx AiDIN, the AI engine and personal assistant that infuses traditional AI and genAI across its enterprise-grade platform. For example, users can minimize documentation effort via AiDIN Workflow Summary, which provides automated commentary on every workflow’s purpose, inputs, outputs, and key logic steps developed by a user. With AiDIN Copilot, users engage directly with the Alteryx platform via natural language prompts to obtain clear guidance on the appropriate workflow construct, then see the workflows being developed for them. This provides explainability and repeatability in outcomes. And the Alteryx platform can be used to invoke LLMs or SLMs themselves and provide a mechanism to integrate their use into production as part of analytical workflow.

With an exciting roadmap of new capabilities in the wings, Alteryx will continue to simplify the lives of business users by allowing them to develop their own workflows. This will allow experienced data professionals to manage complex and high-quality data pipelines critical for the development of AI models. Organizations that invest in enterprise-wide data literacy while exploring new AI applications will be best placed to adapt to the AI revolution that we see unfolding at exponential pace.

Tags