ビックデータで起こりがちなExcelのエラー

Technology   |   Paul Warburg   |   Jul 1, 2016 TIME TO READ: 4 MINS
TIME TO READ: 4 MINS

If Excel were a person, it’d be old enough to run for the U.S. Senate. When a piece of software survives for that long, it’s because it offers something useful and valuable to the world. But as data advances to new heights in size and complexity, Excel is beginning to show its age. The simple fact is, Excel’s framework for data processing wasn’t made for big data.

Beyond just slowing down to an intolerable level, using Excel on these bigger and more complex datasets is prone to errors. And when it comes to business-critical data, those errors can translate into huge consequences. Despite its continued usefulness for small datasets, Excel is reaching its breaking point among analysts working at the scale of big data.

In this post, we outline some of the ways that Excel is error-prone, and why your organization might be ready for a new solution.

1. Excel maxes out at a million rows → No holistic view of your data.

Depending on the complexity of the formulas you have applied, it might take tens to hundreds of minutes to refresh an Excel spreadsheet with a few hundred thousand rows. But for datasets with more than one million rows, you have to split the data into separate files, which doesn’t allow you to see patterns, surface comprehensive insights, or even find mistakes spread across the files.

Consider this: With Designer Cloud (formerly Trifacta), PepsiCo was able to comprehensively evaluate their data, and spotted a mistake that saved the business millions in the process. To address the speed of updating the Excel sheet, Designer Cloud works with samples, which is so effective that PepsiCo was able to save 70% to 90% of their data preparation time.

2. Excel is optimized for single cell manipulation.

Excel’s framework was designed to manipulate a single cell at a time. These manipulations can be automated or applied by broadly using mechanisms like formula extrapolation, macros and array functions, but you still have to manually apply these manipulations to the entire target set of cells. And if you need to improve upon a complicated formula, Excel offers a global “undo” and “redo” function, but there’s no way to modify a step in between.

Also, Excel’s “join” capabilities do not scale for big data use. The VLOOKUP join is restricted to matching values from the left-most column, which means complex, multi-keyed joins require a bunch of manual editing first. Ever had to try and interpret a colleague’s Excel logic? There’s no way Excel can help you learn that logic faster, which means more time and more error potential with every edit. With so much manual entry and manipulation, a single moment of lost concentration  can introduce errors and affect results.

Consider this:  With Trifacta, users can modify and redo steps in the data wrangling process. It allows for improved automation across an entire dataset, encouraging repeatability and decreasing the number of mistakes.

3. Excel doesn’t have visual profiling to easily spot errors.

Let’s face it. Identifying data errors in Excel is hard.  We all love playing “find the flaw,” but at a few hundred thousand rows, this game gets annoying—and impossible.  Even Excel’s data validation rules, which prompt custom messages, have to be manually created and applied.  And while the quick summaries in Excel at the bottom of a  column or row (such as sum, max, min, and mean) are great, Excel isn’t able to do that same trick for non-numeric data.

Consider this: Visual profiling is a core strength of Designer Cloud. With the ability to visually represent all types of non-numeric data—strings, geographic data, datetimes—it’s far easier to understand what’s in your data, missing and  inaccurate data included.

4. Excel doesn’t encourage collaboration.

Everyone knows two heads are better than one, and that goes for spotting mistakes, too. While an Excel file will contain all the relevant data and applied formulas that transformed that data, you you have manually inspect each cell to understand what formulas were applied and what the inputs were. This means that reviewing another analyst’s work, whether to inspect for errors or repeat it yourself, is time-consuming and often neglected altogether.

Consider this: For improved collaboration, a more useful representation would be to summarize the data and the transformation logic separately and in one central location, as in Designer Cloud’s interface.

The Bottom Line

We love Excel and use it here at Alteryx every day. Like a pair of old slippers, it’s comfortable and easy to use. But when it comes to manipulating, analyzing, and sharing big data, we want something built to support us as we hike the rocky terrain of big data. That’s why we made Designer Cloud.

 

Tags