Posted by Karen
April 7, 2022
The explosive growth of data has provided incredible business opportunities but has also presented challenges for many companies, and one of the biggest challenges is ensuring data accuracy. Although we have seen some companies with robust data validation practices, more often companies are in dire need of a systematic and disciplined approach. It is not unusual for business users to tell us they receive inaccurate or questionable data from their data team! The use of “bad” data can significantly impact the performance of the business and in some cases, prove catastrophic.
Data validation, a form of data cleansing, is the process of verifying the accuracy and quality of data before using it.
As a business user, you may not realize that raw data rarely meets an organization’s analytical needs, so it must be manipulated into a form suitable for further analysis. Any data process requiring large volumes of data to be transformed, merged, and cleansed is intrinsically error prone – the more data, the greater the likelihood of error.
For example, when moving and integrating data from different sources and repositories, it may not conform to business rules and may become corrupted due to inconsistencies in type or context.
“Bad” data includes data that is inaccurate, incomplete, inconsistent, duplicated, poorly compiled, or not relevant for its intended use.
“Bad” data can be prevented by following best practices for data validation. The goal is to create data that is consistent, accurate, and complete to ensure the data presented to business users in reports, dashboards, or other tools is correct. When data is consistently accurate, business users trust reports and can confidently make critical decisions based on the right data. If data can’t be trusted, insights from reports or data visualizations can’t be trusted, and ultimately, companies won’t be able to make data driven decisions.
As a first step, it is critical that the data from each source follows data quality measures to ensure that the validation process begins with “high quality” data.
“High quality” data indicates data meets the needs of the organization in terms of operating, and decision and planning support.
Data Quality measures should be verified for each data source:
Accuracy: Are the data records error-free, and can they be used as a reliable source of information?
Data Completeness: Is data complete for all relevant information?
Consistency: Does information in one table match the same information in another?
Timeliness: Is data readily available when the business needs it?
Validity: Does data follow business rules? Business rules are a set of actions or constraints that are applied to data to comply with data quality standards as well as make the data usable and meaningful to non-technical data consumers.
Uniqueness – Do tables consist of unique sets of data or is data repeated among tables?
It is strongly recommended that a proactive approach be taken to identify potential data inconsistencies early to avoid the complexity, cost, and time of having to fix them during later stages of the project.
Data integration is a process that combines data from multiple sources into a single unified data repository.
Once data quality measures have been verified for all data sources, there are many different transformations, integrations, and aggregations required for large volumes of data within an ETL (Extract, Transform, Load) process.
ETL (Extract, Transform, Load) is a data integration process that collects data from original sources (Extract), cleans and combines it into a format that can be analyzed (Transform), and centralizes it into a target repository (Load).
If any single one of the ETL processes is not developed correctly, the resulting metrics will be inaccurate. This in turn may result in unjustified decision making at the business level. Analytics are only as good as the data that supports it, so it is crucial to implement best practices early when developing ETL workflows.
Optimus SBR Data uses our ADM – Analytical Data Mart to blend multiple, disparate data sources for further analytics and data visualization. Data validation is built into the ADM framework, and each of the ADM’s three tiers – landing, integration, analytics – has a specific purpose and set of validation techniques.
Since a report or dashboard is only as useful as the data that powers it, creating test cases to support data accuracy is crucial in the validation process. Each tier has its own set of test cases that support that tier’s purpose. ETL testing ensures the transfer of data from different sources to a target strictly adheres to transformation rules and remains compliant with all validity checks.
Landing Tier maintains an exact copy of data from the source tables to ensure that a reference of the source is always available.
Test cases: Since the landing layer functions as an exact copy of the original data, the record values and data types must match. The test case for the Landing Tier involves performing a count of records and a list of the metadata of the original and the copy. If these tests are successful, a more detailed test can be performed comparing the values of the original and copy using an ETL tool such as KNIME or Alteryx.
Integration Tier combines the raw data from the landing layer by applying transformations and data structure best practices (e.g., consolidations, aggregations, removal of duplicates).
Test cases: There are several test cases involving the structure of the data that must be created at the Integration Tier to test relationships between fields, tables, and structure.
Data redundancy: Normalization is applied to reduce data redundancy. This divides large redundant tables into smaller tables with a specific purpose and links them using relationships.
Data Integrity: Validation occurs before inserting, updating, or deleting the data. Tests can be performed to determine if the metrics contain any incorrect data (e.g., if sales contain any negative numbers or name fields contain special characters). Another test would be to insert inconsistent data to ensure it fails (e.g., insert a product number with 8 characters when the format is 7).
Parent/child: Data behaviour also requires testing. The database should indicate an error when a child record is inserted before a parent record. For example, when a sale is added that contains a product number that does not exist in the product table.
Cardinality: Involving the relationship of data in one table joined with another table, cardinality refers to whether a relationship is one-to-one, many-to-one, or many-to-many. For example, when testing the relationship between a sales fact table and a product dimension table, many of the products will be the same, but a product dimension table will have unique values for products, so joining these tables will create a many to one relationship. To test the cardinality, a distinct list of the primary keys (fields designated to identify unique records) is pulled in both tables. The table with the cardinality ‘one’ will have unique values while the table with the cardinality ‘many’ will not. If the product table has duplicate Product IDs, we know that it will involve some cleansing.
Analytical Tier transforms data from the integration layer to create tables and marts that apply specifically to the business. The analytical layer is pulled by a data analysts or data scientist to be used by the business. The function of this layer is to answer real business questions, so it is important to test scenarios that the business will regularly ask.
Test cases: For a retail store, for example, an analyst would create a table with metrics such as sales, margin, and cost that are sliced by dimensions such as region, brand, and product. The resulting information would answer questions like:
What are the sales by region for this year?
What are the 5 least profitable stores this month?
Which stores improved their sales the most compared to last year?
Tests cases are performed that align with how the business slices their data. The answers can then be cross referenced with data from each tier to verify data accuracy.
When large volumes of data are being validated for analytics, manually sifting through millions of records can not only be error prone but very time-consuming. A great way to increase the efficiency of data validation is to automate tasks using SQL functions. Specific test cases are queried by looking at metrics and comparing the source value and destination value. Queries are named by using a function, so that function can be run when you make changes to the ETL process. Here is a sample output:
A significant delta will indicate which tables and fields are producing problems in the workflow, so they can be rectified. This greatly reduces the amount of manual work, reduces errors, and speeds up the process of data validation.
Data Validation has become increasingly important and complex with the massive data projects we are seeing. Businesses need to have absolute trust in their data, and decisions must be based on accurate data. The solution is to implement and adhere to a rigorous data validation process that follows the…
Explore Posts By Category