Following a rigorous methodology is key to delivering customer satisfaction and expanding analytics use cases across the business.
The number one mistake when starting a Data Management project is skipping data profiling. Data discovery and analysis can be laborious and complex, involving massive volumes of data with relationships that are difficult to unravel, but bypassing this crucial step (particularly when embarking on a digital transformation) will often result in project delays and re-work. No matter how well you think you know your data, existing documentation, source code, data models and staff experience are often outdated, incorrect or missing.
Data Profiling is a fundamental activity performed in the early phase of a Data Management project (e.g., Master Data Management, Data Integration, Data Quality, Data Governance). This best practice article primarily applies to Master Data Management (MDM) implementations, although it is tool agnostic for this purpose.
As defined by Gartner, “Data profiling is a technology for discovering and investigating data quality issues, such as duplication, lack of consistency and lack of accuracy and completeness. This is accomplished by analyzing one or multiple data sources and collecting metadata that shows the condition of the data and enables the data steward to investigate the origin of data errors. The tools provide data statistics, such as degree of duplication and ratios of attribute values, both in tabular and graphical formats.”
Informatica’s view on Data profiling is that it is a process for assessing the quality and structure of data sources, so that you have a complete and accurate picture of your data. Data profiling verifies that data columns are populated with expected types of data. If a profile reveals problems in the data, the data quality concerns can be addressed to correct and prevent data anomalies.
The primary reasons to tackle a data profiling exercise in any data related IT initiative include:
Before any data can be accurately integrated in an MDM project, its content, quality, and structure must be understood. Data profiling can set expectations on what end results to expect from incoming source data quality. Early diagnostics of poor data quality provides opportunities for organizations to resolve the issues, thus making the data usable for further application. Early detection means earlier mitigation and higher success. It also improves business user engagement and promotes good data governance.
A customer wants to master and consolidate product data from six enterprise systems into MDM. The customer expected over 90% of the data to be deduplicated at go live. However, after profiling the sources, it was discovered that many critical data elements needed for matching were missing, and less than 60% of the data could be deduplicated without additional cleansing and enrichment. Because of these findings, the customer was able to reset expectations.
Simple descriptive statistics (i.e., column-level based data profiling applied in data quality assessments) supports the definition of data quality metrics / scorecards, data integration requirements, and matching / identity resolution rules. Understanding the data, requirements, and rules leads to correct, elegant, and sustainable technical design specifications and the ability to create comprehensive, relevant test cases.
A customer is confident that its party contact information was complete and of high quality because of checks put in place on the onboarding applications. After profiling the data, it was discovered that about 20% of the contact information was invalid.
Data profiling identifies problematic “hot spots” early in the project and triggers development of match mitigation strategies. It also identifies any overlaps in match attributes across sources in support of match rules designs and strategies. Perceived trustworthiness of source data can be validated and lead to timely discussions amongst key project stakeholders to arrive at the correct trust rules. This ensures each ‘golden record’ attribute does come from the confirmed source.
Another element of the data profiling analysis is to understand the referential integrity within and across sources (along with other relationships between data structures). If not addressed, the project team often faces issues with rejected records which requires additional time and effort to analyze, troubleshoot and to rewrite code. Assistance may even be required from other non-project teams to modify source data extracts or re-develop source data views.
The unique source identifier is a key concept in MDM. If it is not confirmed unique within a source before data is loaded, it may cause unnecessary delays to address the unexpected duplication. Additionally, if an identifier is missing, investigating other columns more closely can identify alternative data options or a group of columns that can be leveraged to create a compound, unique key for the project to assure a source record uniqueness.
A customer in financial services expected to consolidate data based on the Tax ID because the Tax ID is unique. This customer had recently merged with another firm. The new firm used the Tax ID of the money manager instead of the individual investor when an individual investor used a third-party money manager. There were over 35 million accounts, and this scenario impacted less than half a million. Without data profiling, this scenario would have only come to light during UAT, or worse, after Production go-live. It would have impacted data ingestion and consolidation rules.
A lack of unique identifiers can also highlight a potential issue for delta definition and jeopardizes the required processing SLA times. If source delta extract (inclusive of deleted, updated and new records) cannot be provided; or if delta cannot accurately be formed based on data profiling observations, then a full data set processing for each load cycle might be required. This, in turn, increases the processing time for the accurate and complete data to be delivered to its end consumer. This presents another opportunity to address these risks earlier in a project and set expectations for the delivery timing.
Business leaders can make strategic and trusted decisions by leveraging data profiling results to understand the quality, shape and characteristics of source data. This mitigates risks early-on that can prevent data-related IT projects from moving towards digital transformation. When planning Data Management projects, it is important to not skip the critical step of data profiling.
Success
Link Copied to Clipboard