Data Quality 101: Getting Started with Data Quality

Data Perspective

What is Data Quality?

Data quality is a measurement of how trustworthy a dataset is for making a data-driven decision based on that dataset. Decisions based on data facts and analysis are only as good as the source data that they are based on. If an organization makes a decision based on low-quality data, the outcome will likely not have the intended result.

How to Measure Data Quality

Data quality is evaluated on a number of measures. Six standard measures of data quality are accuracy, timeliness, consistency, completeness, validity, and uniqueness. Each measure has implications that will affect data quality in a different way.

Consequences of Poor Data Quality

To consider how each of these quality attributes would affect data decisions in an organization, consider some use case examples from various industries and the consequences of poor data quality.

Accuracy – Is the data correct?

An airline wants to promote a summer fare sale. The marketing department is going to send promotional materials with airline discount codes to customers who have flown in the last 3 years. Customer communication depends on having accurate contact information – in this case, email or mailing addresses. If the data is not accurate, the promotion can not be delivered to the intended customers and the airline will not be able to maximize capacity on flights.

Timeliness – How recent is the data?

A hospital imaging department is scheduling patients for MRIs. The hospital only has one MRI machine and it’s always in high demand. Physicians are ordering MRIs for their patients and requests are going to the scheduling department. If the scheduling department is working from data that is not timely, they may not know about cancelled scans or patient appointment conflicts to optimize use of a scarce resource for the best patient care results.

Consistency – Is the data consistent across related datasets?

A packaged goods distributor is optimizing delivery routes. The data shows that a warehouse is in Portland. It is important that warehouse codes are consistent across data so that if one dataset shows that the warehouse is located in Portland, Oregon that a related dataset doesn’t show that same warehouse located in Portland, Maine. If the data is not consistent, the delivery routes will be inaccurate and one of the warehouses will miss their delivery supply.

Completeness – Does the dataset have any null values?

A telco company is analyzing dropped calls to predict customer satisfaction and expected churn rates. A significant number of cell towers in the southeast lost connection during a recent hurricane that caused dropped calls during a critical event, but the data from those towers is missing from the dataset and the data fields are empty. The customer satisfaction analysis is based on incomplete data. The telco company is missing a portion of essential data in the analysis and will miss the opportunity to start proactive customer care for customer retention.

Validity – Is the data in the correct format?

An insurance provider is analyzing claim rates and wants to know which regions in the United States have higher instances of certain claims. The analysts are using event history with addresses and zip codes to predict future claims that will be used to set rates for the next 5 years, but they are using low quality data. The zip code field is expected to have standard US zip 5-digit zip codes. Many of the entries have 5-digit zip codes, some have 5-digit zip code + 4 digits, but one of the regional offices incorrectly entered all of the zip code entries as area codes from hail claims after a major storm. If the analysts use this dataset as is to determine rates for future years, they will incorrectly assess the hail risk for a region that will affect all of their distribution and rates for all customers.

Uniqueness – Does each line represent an individual identifier?

In some datasets, it is critical that the data line items are unique. When a financial services institution assigns account numbers, it is critical that each account number is unique to identify a single account. If multiple, unrelated accounts are all assigned the same account number, it will be difficult to determine who owns the bank account.

How to Improve Data Quality

Assessing and managing data quality is a challenge for most organizations. How does a data team manage data that is constantly changing and needs to be kept up to date? How do data owners make sure that their organization is using the highest quality data?

A combination of people, processes and technology will deliver the best results for data quality. Here are 5 suggestions to implement a data quality program:

  • Make data quality a priority for the organization because decisions based on bad data will lead to the wrong conclusions.
  • Enable the data owners and business owners to set data quality goals and rules because the people who use the data most will know what is important for analysis.
  • Make data quality rules easy to understand and use plain language to describe data rules.
  • Make data quality measurements clear and available for data workers to select the highest quality data.

How to Get Started Managing Data Quality

Data quality is essential to understand the health of data in your environment to manage data, resolve issues, and use the best data for business decisions. Data quality is measured in various dimensions that data owners can monitor by the attributes that are important for that specific data set. BigID helps to scale and automate data quality management and measurement to make a once-difficult and complicated problem into a manageable solution to deliver the best data quality results.

Is managing data quality a challenge at your organization? See how BigID adds automation and insight to improve Data Quality for better business outcomes.