Data Quality 101: What It Is, Why It’s Important

Data Perspective

What Is Data Quality?

Organizations make decisions based on data — and those decisions are only as good as the data they’re based on. If a business makes a decision based on low-quality data, the outcome is not likely to meet expectations.

Data quality measures how reliable a dataset is for making a data-driven decision — or, in a word, the data’s trustworthiness.

Why Is Data Quality Important?

According to Gartner, poor data quality costs organizations $12.9 million annually and comes with a heap of related negative consequences, like damaged customer relationships, ill-informed business decisions, and muddled-up data ecosystems.

Fortunately, data quality is having its day, with more and more enterprises focusing on how it can drive better business decisions. This year, Gartner predicts that metrics-driven data quality tracking will increase by a whopping 60 percent.

Organizations can capitalize on the competitive advantage that improving data quality will give them — and now is the time.

Data Quality Dimensions

There are six standard dimensions of data quality. Each one impacts business decisions in a different way, but must also be considered in relation to the others. The six dimensions are:

  1. Accuracy
  2. Timeliness
  3. Consistency
  4. Completeness
  5. Validity
  6. Uniqueness

Here are some use cases across various industries that demonstrate how each of these attributes might affect an organization’s data decisions.

Accuracy — Is the data correct?

An airline wants to promote a summer sale. The marketing department is going to send promotional materials with airline discount codes to customers who have flown in the last three years.

Customer communication depends on having accurate contact information — in this case, email or mailing addresses. If the data is not accurate, the promotion cannot be delivered to the intended customers — and the airline will not meet its goals for the promotion.

Timeliness — How recent is the data?

A hospital imaging department is scheduling patients for MRIs. The hospital only has one MRI machine, and it’s always in high demand.

When physicians order MRIs for their patients, requests go to the scheduling department. The scheduling department must work from data that is as up-to-date as possible to know about canceled plans or appointment conflicts. Otherwise, they won’t be able to optimize the use of a scarce resource for the best patient care.

Consistency — Is the data the same across related datasets?

A packaged goods distributor is optimizing delivery routes. The data shows that a warehouse is in “Portland.”

The warehouse codes must be consistent across datasets, so that if one set of data shows that the warehouse is in Portland, Oregon, another related dataset doesn’t suggest that the same warehouse is in Portland, Maine.

If the location data is not consistent, the delivery routes will be inaccurate and one of the warehouses will miss their delivery supply.

Completeness — Does the dataset have any null values?

A telco company is analyzing dropped calls to predict customer satisfaction and expected churn rates. A significant number of cell towers in the southeast lost connection during a recent hurricane.

While the natural disaster caused a number of dropped calls, the data from those towers is missing from the dataset — and those fields are empty. The customer satisfaction analysis is based on incomplete data.

Since the telco company is missing part of its essential data, the resulting analysis will be incorrect, thwarting or delaying its efforts toward proactive customer care for customer retention.

Validity — Is the data in the correct format?

An insurance provider is analyzing claim rates and wants to know which regions in the United States have higher instances of certain claims. The analysts are using event history with addresses and zip codes to predict future claims that will help them set rates for the next five years — but … they are using low-quality data.

The zip code field is expected to have standard, U.S. five-digit zip codes. Many of the entries have five-digit zip codes; some have five-digit zip codes + four digits; and one of the regional offices incorrectly entered zip codes as area codes after hail claims from a major storm came in.

If the analysts use this dataset — as is — to determine the next five years’ rates, they will incorrectly assess the hail risk for a region, and that assessment will affect all of their rates for all customers.

Uniqueness — Does each line represent an individual identifier?

In some datasets, data line items must be completely unique. When a financial services institution assigns account numbers, it is critical that each account number uniquely identifies a single account. If multiple, unrelated accounts are all assigned the same account number, it will be difficult to determine who owns the bank account.

Data Quality Best Practices

Assessing and managing data quality is a challenge for most organizations. How does a data team manage data that is constantly changing and needs to be kept up to date? How do data owners make sure that their organization is using the highest quality data?

Organizations need the right people, processes, and technology to deliver the best quality for their data. To implement an effective program:

  • Make data quality a priority for the organization.
  • Understand how inaccurate, outdated, inconsistent, incomplete, invalid, and redundant data will lead to incorrect analysis, misguided business decisions, and lost revenue.
  • Enable data owners and business owners to set data quality goals and rules — the professionals who use the data most will know what is most important for analysis.
  • Make the rules easy to understand, and use plain language to describe them.
  • Make data quality measurements clear and available for data workers to select the highest quality data.
  • Define, establish, and implement standards across the enterprise.

How to Get Started with BigID

Data quality is measured according to various dimensions that data owners can track and monitor by specific data set. This tracking is essential for organizations to:

  • understand the health of their data
  • manage data
  • resolve data issues
  • use the best data for business decisions

BigID helps scale and automate data quality measurement and management, turning a labor-intensive, complex problem into a manageable, ML-based solution. With BigID, organizations can:

  • Actively monitor data anomalies to improve efficiency
  • Dynamically profile changing data to derive relevant data quality scores
  • Add custom metrics to datasets
  • Apply data quality scores across data sources
  • Get 360° insights for all data (structured, unstructured, semi-structured, on-prem, in the cloud, and hybrid) for the broadest coverage available in the market — all in a unified inventory
  • Take action to improve the accuracy, timeliness, consistency, completeness, validity, and uniqueness of their data
  • Take a proactive approach that creates a competitive advantage and leads to well-informed business decisions
  • Use their data with confidence

Is managing data quality a challenge at your organization? See how BigID adds automation and insight to lead to better business outcomes.