Data preparation can be one of the most time-intensive, repetitive tasks during your work week. If you don’t clean, validate, and consolidate your raw data the right way, dirty data will run rampant and your insights won’t be safe.
So how do you make sure your data preparation is keeping dirty data at bay while protecting the data your company depends on?
It takes vigilance and superhero strength.
Grab your cape, shield, or Batarang, and ensure that your data is bulletproof — we’re talking invulnerable to formatting errors, inaccuracies, and outliers — with these six steps to saving the day with an indomitable data preparation strategy.
1. A Bird’s-Eye-View Sharper Than
Hawkeye’s
Before you start working intensely with a new dataset, it’s a good idea to step boldly into the raw material and do a bit of exploring. Genetically modified eyesight can help (like Hawkeye’s), but it’s not required. Start with a mental picture of what you’re looking for, but also keep an open mind and let the data do the talking.
Tips: Data exploration
- Scan column names and field descriptions to see if any anomalies stand out, or any information is missing or incomplete.
- Do a temperature check to see if your variables are healthy: how many unique values do they contain? What are the ranges and modes?
- Spot any unusual data points that may skew your results. You can use visual methods — say, box plots, histograms, or scatter plots — or numerical approaches such as z-scores.
- Scrutinize those outliers. Should you investigate, adjust, omit, or ignore them?
- Examine patterns and relationships for statistical significance.
2. Data More Refined Than Iron Man’s Core
Data full of errors and inconsistencies carries a big price tag: Studies have shown that dirty data can shave millions off a company’s revenue. Because these mistakes can be as expensive as a palladium core, to prevent big losses, you’ll need to clean your data until it glows with a fervent, self-illuminating light.
Tips: Data cleansing
- Ditch all duplicate records that clog your server space and distort your analysis.
- Remove irrelevant rows or columns that won’t affect the problem you’re trying to solve.
- Investigate and possibly remove missing or incomplete info.
- Nip any unwanted outliers you discovered during data exploration.
- Fix structural errors — typography, capitalization, abbreviation, formatting, extra characters.
- Validate that your work is accurate, complete, and consistent, documenting all tools and techniques you used.
3. A Stronger Combination Than the Avengers
The more high-quality sources you incorporate into your analysis, the deeper and richer your insights. Any project you undertake will typically require six or more data sources, requiring data blending tools to fuse them together seamlessly. Basically, you need to assemble the ultimate team of high-quality data.
Tips: Data blending
- Acquire and prep. If you’re using modern data tools rather than trying to make files conform to a spreadsheet, you can include almost any file type or structure that relates to the business problem you’re trying to solve and transform all datasets quickly into a common structure. Think files and documents, cloud platforms, PDFs, text files, RPA bots, and application assets like ERP, CRM, ITSM, and more.
- Blend. In spreadsheets, this is where you flex your VLOOKUP muscles. (They do get tired, though, don’t they?) If you’re using self-service analytics instead, this process is just drag-and-drop.
- Validate. It’s important to review your results for consistency and explore any unmatched records to see if more cleansing or other data prep tasks are in order.
4. Data Sense Is the New Spidey Sense
Data profiling, data exploration’s cousin, requires more scrutiny. It means examining a dataset specifically for its relevance to a particular project or application. You’ll have to use your instincts and know-how to determine whether a dataset should be used at all — a big decision that could have serious financial consequences for your company.
Tips: Data profiling
- Structure profiling. How big is the dataset and what types of data does it contain? Is the formatting consistent, correct, and compatible with its destination?
- Content profiling. What information does the data contain? Are there gaps or errors? This is the stage where you’ll run summary statistics on numerical fields; check for nulls, blanks, and unique values; and look for systemic errors in spelling, abbreviations, or IDs.
- Relationship profiling. Are there spots where data overlaps or is misaligned? What are the connections between units of data? Examples might be formulas that connect cells, or tables that collect information regularly from external sources. Identify and describe all relationships, and make sure you preserve them if you move the data to a new destination.
5. Establish Your Secret Base
With the enormous volume and complexity of data sources available to you, it’s inevitable that you’ll need to extract it, integrate it, and store it in a centralized location that allows you to retrieve it for analysis whenever you need it — kind of like a secret base (Batcave?) for your day-saving data.
Tips: ETL (Extract, Transform, Load)
- Extract. Pull any and all data — structured or unstructured, one source or many — and validate its quality. (Be extra thorough if you’re pulling from legacy systems or external sources.)
- Transform. Do a deep cleanse here, and make sure your formatting matches the technical requirements for your target destination.
- Load. Write the transformed data to its storage location — usually, a data warehouse. Then sample and check for data quality errors.
6. Wrangling Better Than Wonder Woman’s Lasso of Truth
The term “data wrangling” is often used loosely to mean “data preparation,” but it actually refers to the preparation that occurs during the process of analysis and building predictive models. Even if you prepped your data well early on, once you get to analysis, you’ll likely have to wrangle (or munge or lasso) it to make sure your model will consume it — not spit it back out.
Tips: Data wrangling
- Explore. If your model doesn’t perform the way you thought it would, it’s time to dive back into the data to look for the reason.
- Transform. You should structure your data from the beginning with your model in mind. If your dataset’s orientation needs to pivot to provide the output you’re looking for; you’ll need to spend some time manipulating it. (Automated analytics software can do this in one step.)
- Cleanse. Correct errors and remove duplicates.
- Enrich. Add more sources, such as authoritative third-party data.
- Store. Wrangling is hard work. Preserve your processes so they can be reproduced in the future.
And there you have it. Follow these six steps and your data will be faster than a plane, stop bullets in their tracks, and essentially, wear a giant, flowing red cape.
Oh, and if you’re tired of VLOOKUPs and manual processes, you might want to give automation a try. We’re talking analytic superpowers, the ability to access data in any format and automate your current prep processes with an automated, self-service analytics platform.
In our experience, automating data preparation looks like this:
- Save the day — faster: Switching to an automated platform almost always produces a measurable return in a matter of days or weeks.
- Full-time, efficient crime-fighting: Automation completely changes the focus of your workday from manual, repetitive tasks to creative ones. And you’ll never have to solve the same data problem twice.
- Get a sidekick … or four: When you eliminate the need for data gatekeepers, you can engage the entire organization. Employees at all levels begin coming up with new ways to expand their own capabilities.
It’s such a profound change — a different universe, really — that we have a name for it: Analytic Process Automation (APA).
Learn more about APA and how it gives you analytic superpowers.
Ready for More?
READ
Discover how easy it is to switch to a modern-day analytics approach in A Radical Guide to Data Analytics Mastery.
EXPERIENCE
Jumpstart your mastery of all thing data blending with the Alteryx data blending starter kit.