To be successful in business, you need to make decisions fast and based on the right information. One of the important functions of a data warehouse and business intelligence solution is to provide users with a snap-shot of their business at any given point of time. This allows decision makers to gain better insight into their business and market so that they can make decisions faster and better.
While a business intelligence system makes it much simpler to analyze and report on the data loaded into a data warehouse system, the existence of data alone does not ensure that executives make decisions smoothly; the quality of the data is equally as important.The existence of data alone does not ensure that decisions are made smoothly; the quality of data is just as important. Click To Tweet
Consider a high-level meeting to review company performance: if you learn that two reports compiled from supposedly the same set of data reflect two different revenue figures, no one can know which figures are accurate, which could cause important decisions to be postponed while the “truth” is investigated.
One of the causes of data quality issues is in source data that is housed in a patchwork of operational systems and enterprise applications. Each of these data sources can have scattered or misplaced values, outdated and duplicate records, and inconsistent (or undefined) data standards and formats across customers, products, transactions, financials and more.
Data quality problems can also arise when an enterprise consolidates data during a merger or acquisition. But perhaps the largest contributor to data quality issues is that the data are being entered, edited, maintained, manipulated and reported on by people.
To maintain the accuracy and value of the business-critical operational information that impact strategic decision-making, businesses should implement a data quality strategy that embeds data quality techniques into their business processes and into their enterprise applications and data integration.
On the surface, it is obvious that data quality is about cleaning up bad data – data that are missing, incorrect or invalid in some way. But in order to ensure data are trustworthy, it is important to understand the key dimensions of data quality to assess how the data are “bad” in the first place.
Completeness is defined as expected comprehensiveness. Data can be complete even if optional data is missing. As long as the data meets the expectations then the data is considered complete.
For example, a customer’s first name and last name are mandatory but middle name is optional; so a record can be considered complete even if a middle name is not available.
Questions you can ask yourself: Is all the requisite information available? Do any data values have missing elements? Or are they in an unusable state?
Consistency means data across all systems reflects the same information and are in synch with each other across the enterprise. Examples:
- A business unit status is closed but there are sales for that business unit.
- Employee status is terminated but pay status is active.
Questions you can ask yourself: Are data values the same across the data sets? Are there any distinct occurrences of the same data instances that provide conflicting information?
Conformity means the data is following the set of standard data definitions like data type, size and format. For example, date of birth of customer is in the format “mm/dd/yyyy”
Questions you can ask yourself: Do data values comply with the specified formats? If so, do all the data values comply with those formats?
Maintaining conformance to specific formats is important.
Accuracy is the degree to which data correctly reflects the real world object OR an event being described. Examples:
- Sales of the business unit are the real value.
- Address of an employee in the employee database is the real address.
Questions you can ask yourself: Do data objects accurately represent the “real world” values they are expected to model? Are there incorrect spellings of product or person names, addresses, and even untimely or not current data?
These issues can impact operational and analytical applications.
Integrity means validity of data across the relationships and ensures that all data in a database can be traced and connected to other data.
For example, in a customer database, there should be a valid customer, addresses and relationship between them. If there is an address relationship data without a customer then that data is not valid and is considered an orphaned record.
Ask yourself: Is there are any data missing important relationship linkages?
The inability to link related records together may actually introduce duplication across your systems.
Timeliness references whether information is available when it is expected and needed. Timeliness of data is very important. This is reflected in:
- Companies that are required to publish their quarterly results within a given frame of time
- Customer service providing up-to date information to the customers
- Credit system checking in real-time on the credit card account activity
The timeliness depends on user expectation. Online availability of data could be required for room allocation system in hospitality, but nightly data could be perfectly acceptable for a billing system.
As you can see, data quality is an important issue that should be considered – starting with initial application design, all the way through implementation, maintenance and use. In future blog posts, we will discuss techniques for addressing these types of data quality issues.