As a customer success manager at Alteryx, I spend most of my time helping our customers wrangle their raw, big data into business insights. On these data-wrangling projects, it’s tempting to jump straight into the most interesting problems, but to produce the most accurate results, we should start by performing a set of basic data quality validations.
We call these initial data quality validations “sanity checks.”
Running sanity checks is an important (and often neglected) part of the analytics process. As with all things analytics- and AI-related, your final analysis is only as accurate as your data, which means it’s more than worth it to take a few minutes to validate your data’s accuracy and completeness. A quick sanity test goes a long way in accurate analysis.
At Alteryx, we believe so strongly in sanity checking that we’ve incorporated robust profiling into our own products and functionality. Read on to see how I use Designer Cloud to perform four key sanity checks while wrangling data for customer projects.
1. Take a random sample of the data.
Why this sanity test is useful for data validation
Often, the data sets that I’m working with are too large to easily assess as a complete source. Looking only at a consecutive sample of that data — for example, the first 1,000 rows — can give me an incomplete understanding of the contents of a data set. By generating a random sample over the entirety of the data set, I get a more accurate picture of the full data set.
Alteryx Designer Cloud let me swap between a sample of the first rows of a data set and a random sample from the complete data set.
Real-life sanity checking and data wrangling
Last week, I was showing a customer how to wrangle data for a set of machine-generated logs containing product usage information. Since each product sent log data on a regular basis, the data volume was absolutely huge — there were thousands of individual log files, each with a size of around 2 GB. This meant that we wouldn’t be able to easily examine the complete data set at once.
After loading the data into Designer Cloud, we decided to compare the first row’s sample with the random sample to validate that the structure of the logs remained consistent throughout the data set. We found that the number of categories, or unique strings, increased when we switched to a random sample. We dug a bit deeper into our data validation and noticed the following:
One of our columns sometimes contained an array of additional values! To make sure that the data was structured correctly, we needed to extract this array into a new column. Examining a random sample of our data during a sanity test allowed us to surface this data quality issue and then take steps to address it.
2. Check for datatype mismatches, variations in how values are entered, and missing values.
Why this sanity test is useful for data validation
Effective downstream data analysis requires consistency. You can’t easily understand the relationships between events if some of those events have dates formatted as yyyy/MM/dd and some events have dates formatted as dd/MM/yyyy. Similarly, some systems may store prices in strings ($1,000) while some systems store prices in decimals (1000.00).
How data validation works in Designer Cloud
Designer Cloud automatically profiles your data and provides information about the number of valid, missing, and mismatched values in each column. The results of this profiling are shown in the data quality bar above each column in the data set.
3. Look for duplicate records and outliers.
Why this sanity test is useful for data validation
Both duplicate records and outliers can distort my analysis, so I need to assess the overall quality of the dataset and determine if it contains any duplicate records or outliers.
How data validation works in Designer Cloud
When you’re working with your data in Designer Cloud, you can switch to the column details view, which displays high-level summary statistics about the contents of each column in your data set.
One of our demo data sets contains a column containing the average satisfaction scores for customer service agents. In the column details view, we can quickly see summary statistics about the data, including the number of unique values in the column and any outliers. Our demo data set contains some outlier values. After I’ve identified the outliers, I can make a decision about whether or not to exclude those values from my analysis.
Real-life sanity checking and data wrangling
One of our customers had been constructing an inventory forecast dashboard that wrangled data from multiple retailers. Making sure the data used in the dashboard is accurate is crucial, especially since any inaccurate or duplicated data in the final dashboard could have a direct impact on our customer’s bottom line.
We performed sanity tests at the beginning of the data-wrangling process and discovered that a number of duplicate records had crept into the source data. We were able to resolve this issue and create a highly accurate inventory forecast for upper management to use when collaborating on promotional campaigns with their retail clients.
4. Assess the data distribution for each column.
Why this sanity test is useful for data validation
Sometimes, the data in a column may look good on the surface — there are no obvious datatype mismatches, duplicate records, outliers, or null values — but if you examine the distribution of data in the column, you notice gaps or a values distribution that doesn’t make logical sense. An odd data distribution may indicate a larger data quality issue that you need to investigate: maybe some data points were not recorded, or the data was not encoded correctly.
How data validation works in Designer Cloud
The column details view (discussed earlier) shows you a detailed histogram of value distributions, as well as summary statistics like the standard deviation, median, and average of the values in each column. If you want to quickly validate the data distribution without opening the column details view, you can also look at the small histograms shown above your data.
Real-life sanity checking and data wrangling
Recently, I was working on a project that used raw interaction data to improve the experience on one of our customer’s websites. We were wrangling a set of website visit logs. At first glance, the date/time stamps for each website visit looked valid. However, when we ran a basic sanity check on that data and assessed the distribution of values, we noticed that most of the website visits had occurred between midnight and 3 am. Since it seemed unlikely that a majority of the website visitors took care of their online tasks in the middle of the night, there was clearly a problem with the data!
We investigated the process that was providing us with the weblogs and discovered that the upstream system was truncating any trailing zeroes that appeared in the timestamps. So a timestamp that was really ‘175400’ (17:54.00) was appearing as ‘1754’ (00:17.54). Once we adjusted the data to account for the trailing zero truncation, the data distribution looked much more reasonable.
The bottom line:
Our support portal has articles, videos, and a forum for you to combat your data-wrangling challenges, such as how to view mismatched data. To get started with sanity tests for data validation, sign up for a free trial of Alteryx Designer today!