Data Quality 101: What It Is, Why It’s Important
Data quality has become a critical part of any business. It’s a valuable asset that organizations need to leverage and protect. Organizations across all industries are struggling with increasingly complex data management challenges, including how to improve data quality and manage risk.
What Is Data Quality?
Data quality is the extent to which data is complete, consistent, and accurate. It is a measure of how well data meets the requirements of its intended use.
Organizations make decisions based on data — and those decisions are only as good as the data they’re based on. If a business makes a decision based on low-quality data, the outcome is not likely to meet expectations.
Data quality measures how reliable a dataset is for making a data-driven decision — or, in a word, the data’s trustworthiness.
Why Is Data Quality Important?
According to Gartner, poor data quality costs organizations $12.9 million annually and comes with a heap of related negative consequences, like damaged customer relationships, ill-informed business decisions, and muddled-up data ecosystems.
Fortunately, data quality is having its day, with more and more enterprises focusing on how it can drive better business decisions. This year, Gartner predicts that metrics-driven data quality tracking will increase by a whopping 60 percent.
Organizations can capitalize on the competitive advantage that improving data quality will give them — and now is the time.
Data Quality Dimensions
Data quality dimensions are the characteristics of data that determine its quality. These dimensions can be grouped into six categories:
Here are some use cases across various industries that demonstrate how each of these attributes might affect an organization’s data decisions. The definitions of these six aspects may vary slightly depending on who you ask or in what context they’re being applied—but here’s how we define them:
Accuracy — Is the data correct?
Accurate means that the information gathered is correct and has not been tainted by human error or machine malfunction during collection, processing, storage, analysis or transmission
An airline wants to promote a summer sale. The marketing department is going to send promotional materials with airline discount codes to customers who have flown in the last three years.
Customer communication depends on having accurate contact information — in this case, email or mailing addresses. If the data is not accurate, the promotion cannot be delivered to the intended customers — and the airline will not meet its goals for the promotion.
Timeliness — How recent is the data?
Timely means having enough time prior to use so that changes can be made as necessary.
A hospital imaging department is scheduling patients for MRIs. The hospital only has one MRI machine, and it’s always in high demand.
When physicians order MRIs for their patients, requests go to the scheduling department. The scheduling department must work from data that is as up-to-date as possible to know about canceled plans or appointment conflicts. Otherwise, they won’t be able to optimize the use of a scarce resource for the best patient care.
Consistency — Is the data the same across related datasets?
Consistent means that similar types of records always contain similar elements from one record to another within any given record type.
A packaged goods distributor is optimizing delivery routes. The data shows that a warehouse is in “Portland.”
The warehouse codes must be consistent across datasets, so that if one set of data shows that the warehouse is in Portland, Oregon, another related dataset doesn’t suggest that the same warehouse is in Portland, Maine.
If the location data is not consistent, the delivery routes will be inaccurate and one of the warehouses will miss their delivery supply.
Completeness — Does the dataset have any null values?
Complete means having all required components for a given task or purpose.
A telco company is analyzing dropped calls to predict customer satisfaction and expected churn rates. A significant number of cell towers in the southeast lost connection during a recent hurricane.
While the natural disaster caused a number of dropped calls, the data from those towers is missing from the dataset — and those fields are empty. The customer satisfaction analysis is based on incomplete data.
Since the telco company is missing part of its essential data, the resulting analysis will be incorrect, thwarting or delaying its efforts toward proactive customer care for customer retention.
Validity — Is the data in the correct format?
Data validity refers to the consistency of data values according to established rules and standards.
An insurance provider is analyzing claim rates and wants to know which regions in the United States have higher instances of certain claims. The analysts are using event history with addresses and zip codes to predict future claims that will help them set rates for the next five years — but … they are using low-quality data.
The zip code field is expected to have standard, U.S. five-digit zip codes. Many of the entries have five-digit zip codes; some have five-digit zip codes + four digits; and one of the regional offices incorrectly entered zip codes as area codes after hail claims from a major storm came in.
If the analysts use this dataset — as is — to determine the next five years’ rates, they will incorrectly assess the hail risk for a region, and that assessment will affect all of their rates for all customers.
Uniqueness — Does each line represent an individual identifier?
In some datasets, data line items must be completely unique. When a financial services institution assigns account numbers, it is critical that each account number uniquely identifies a single account. If multiple, unrelated accounts are all assigned the same account number, it will be difficult to determine who owns the bank account.
Data Quality Best Practices
Data quality management is a major concern for organizations in all industries. It can have a significant impact on your business and it’s important to know how to improve it. Applying data quality best practices will ensure that your data will be accurate, complete, consistent and timely for the success of your organization’s future goals.
It’s important to know what data quality is and what it isn’t. Data quality is not one thing; it’s a combination of the five aspects listed above, combined with how they interact with each other. For example, if you have accurate and consistent data but your records are incomplete (meaning you don’t have all the information), your overall data can still be considered poor quality.
Organizations need the right people, processes, and technology to deliver the best quality for their data. To implement an effective program:
- Make data quality a priority for the organization.
- Understand how inaccurate, outdated, inconsistent, incomplete, invalid, and redundant data will lead to incorrect analysis, misguided business decisions, and lost revenue.
- Enable data owners and business owners to set data quality goals and rules — the professionals who use the data most will know what is most important for analysis.
- Make the rules easy to understand, and use plain language to describe them.
- Make data quality measurements clear and available for data workers to select the highest quality data.
- Define, establish, and implement standards across the enterprise.
Data Quality Management Tools
If you want to improve the quality of your data, or if you want to measure and monitor its quality, there are many tools that can help. These tools can help identify issues with data quality and gaps in data quality. Some examples include:
Data Quality Assessment Tools – these tools allow users to create reports that evaluate their datasets based on specific rules or guidelines (e.g., industry standards). They may also provide feedback on what can be improved within each dataset so that they meet appropriate standards.
Data Profiling Tools – these tools use artificial intelligence (AI) algorithms combined with human expertise to analyze large volumes of unstructured text and extract relevant information from it. This helps organizations understand how their customers perceive them; whether those perceptions align with what they’d like them too; if not then why not?
How to Get Started with BigID
Improve Data Quality with BigID
Data quality is measured according to various dimensions that data owners can track and monitor by specific data set. This tracking is essential for organizations to:
- understand the health of their data
- manage data
- resolve data issues
- use the best data for business decisions
BigID helps scale and automate data quality measurement and management, turning a labor-intensive, complex problem into a manageable, ML-based solution. With BigID, organizations can:
- Actively monitor data anomalies to improve efficiency
- Dynamically profile changing data to derive relevant data quality scores
- Add custom metrics to datasets
- Apply data quality scores across data sources
- Get 360° insights for all data (structured, unstructured, semi-structured, on-prem, in the cloud, and hybrid) for the broadest coverage available in the market — all in a unified inventory
- Take action to improve the accuracy, timeliness, consistency, completeness, validity, and uniqueness of their data
- Take a proactive approach that creates a competitive advantage and leads to well-informed business decisions
- Use their data with confidence
Is managing data quality a challenge at your organization? See how BigID adds automation and insight to lead to better business outcomes.