Organizations have been increasing their focus on data analytics and visualization techniques to drive faster decisions, deliver better customer experiences and generate market insights from big data and analytics. The modern-day data platform has undergone rapid changes. These changes have been primarily directed towards the 3 Vs of big data: volume, velocity and variety but what about data quality? Data quality is often considered critical but regularly not addressed in a systematic manner. Poor data quality frequently raises questions such as: What is the confidence level that decisions made based on our rendered analytics are the right ones? And are we blindly jumping to conclusions?
Data quality program is a collection of initiatives with a common objective of maximizing data quality and minimizing the negative impact of poor quality data. To achieve successful data quality and management, you need to focus on Assessment, Collaboration, Remediation, and Monitoring. Let’s take a look at each of those areas in more detail
To use your data efficiently and effectively, you must first understand what it is comprised of and how it is structured. This is where a data assessment provides value. Data profiling tools provide robust capability in connectivity to various data sources, ease of use for business to navigate data insights, work flows to collaborate on data issues, and give you the ability to quickly create data quality rules.
One of the main goals of a data assessment is to develop data quality rules. For example, a rule can be created that validates the data or a relationship between several data elements ranging from a simple entity relationship found in a data model to a complex rule based on business dependencies.
When designed and implemented successfully, such data quality rules can identify most data discrepancies. Coming up with a comprehensive set of rules is a daunting task and requires an iterative process of analyzing them to refine the list. Below are some of the most typical data quality dimensions:
|Completeness||Empty or default values in fields|
|Conformity||Incorrect format in field e.g. a name prefix in the customer name field or noise around telephone numbers.
|Consistency||Look down a column everything looks okay. Look across two columns and there’s a problem, e.g. person coded as a company, company name coded as a person, last invoice date 2 years ago, however the customer status being “live.”|
|Duplicates||Unique ID, however when you look at the other fields, it’s obvious that the records are similar enough to be potential duplicates.|
|Integrity||Associated with relationship, e.g. house holding husband and wife or father and daughter|
|Accuracy||Comparing data with a reference source, e.g. comparing addresses with the PAF, product names with a dictionary, etc.|
Figure 2. Data quality dimensions
Data profiling is simplistic in nature and in order to make sense of data elements, business context is required. This means it’s important to communicate your findings to stewards, analysts, and subject matter experts. This will help improve everyone’s understanding of how data is being used, verify data’s business relevancy, and prioritize critical issues.
These discussions may be tough at times, but they are essential to evaluating the potential ROI of data quality improvements, defining data quality rules, and aligning data quality to acceptable measures.
A cross-disciplinary or data governance council team should be formed that may consist of stewards, subject experts, and data architects. The goal of this team is to continually work together to improve the data quality of identified attributes.
Resolving data quality issues requires a combination of data fixing and the ability to prevent future issues. Data fixing is usually done when data issues are uncovered during data loads or visualization. After that step, you will often then write cleansing rules with reporting/ETL tools or reporting databases that are often known to only a few developers.
Defect prevention is proactive and executed through root cause analysis and process improvements. However, a data governance framework is often necessary for defect prevention to be successful. Data profiling tools should provide the capability to create simple to complex data quality rules to confirm the issues and ensure that they could be implemented in the right application after approval from the governance team.
Key features of data profiling tools to develop a wide range of data quality rules:
- Parsing: Built-in capabilities for decomposing data into its component parts.
- Standardization and cleaning: Built-in capabilities for applying industry or local standards, business rules or knowledge bases to modify data for specific formats.
- Matching, linking and merging: Built-in capabilities for matching, linking and merging related data entries within or across datasets, using a variety of techniques, such as rules, algorithms, metadata and machine learning.
- Address validation/geocoding: Support for location-related data standardization and cleansing.
- Issue resolution and workflow: The process flow and user interface that enables business users to identify, quarantine, assign, escalate and resolve data quality issues.
- Metadata management: The capability to capture, reconcile and interoperate metadata relating to the data quality process.
Data assessment is an ongoing process that is required to periodically monitor data quality and data quality rules validity. Data profiling tools often provide the ability to have data quality score cards. These score cards present aggregate scores by data quality rules or by subject areas, score decompositions by data elements or by data records, error reports, and information about each individual attribute.
If you would like to define your organization’s data quality program, Intersys would be glad to guide you through your data quality improvement journey.