• Success
    Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs
    Success
    Accelerate your Purchase to Value engaging with Informatica Architects for Customer Success
    All your Engagements at one place
  • Communities
    A collaborative platform to connect and grow with like-minded Informaticans across the globe
    Communities
    Connect and collaborate with Informatica experts and champions
    Have a question? Start a Discussion and get immediate answers you are looking for
    Customer-organized groups that meet online and in-person. Join today to network, share ideas, and get tips on how to get the most out of Informatica
  • Knowledge Center
    Troubleshooting documents, product guides, how to videos, best practices, and more
    Knowledge Center
    One-stop self-service portal for solutions, FAQs, Whitepapers, How Tos, Videos, and more
    Video channel for step-by-step instructions to use our products, best practices, troubleshooting tips, and much more
    Information library of the latest product documents
    Best practices and use cases from the Implementation team
  • Learn
    Rich resources to help you leverage full capabilities of our products
    Learn
    Role-based training programs for the best ROI
    Get certified on Informatica products. Free, Foundation, or Professional
    Free and unlimited modules based on your expertise level and journey
    Self-guided, intuitive experience platform for outcome-focused product capabilities and use cases
  • Resources
    Library of content to help you leverage the best of Informatica products
    Resources
    Most popular webinars on product architecture, best practices, and more
    Product Availability Matrix statements of Informatica products
    Monthly support newsletter
    Informatica Support Guide and Statements, Quick Start Guides, and Cloud Product Description Schedule
    End of Life statements of Informatica products
Last Updated Date Aug 27, 2021 |

Challenge

Before beginning a journey to address and remediate data quality issues it is important to have a grasp of the size and scope of the problem. Without a good understanding of the level and depth of data quality issues the project could be drastically underfunded and poorly planned. Performing a data quality audit can provide an initial assessment that can be a key input to the planning and subsequent work to address the issues uncovered.

Description

Data Quality is a key factor in many Data Management projects. The quality of the proposed project source data, in terms of both its structure and content, is a key determinant of the specifics of the business scope and of the success of the project in general.

Problems with the data content must be communicated to senior project personnel as soon as they are discovered. Poor data quality can impede the proper execution of later steps in the project, such as data transformation and load operations, and can also compromise the business ability to generate a return on the project investment. This is compounded by the fact that most businesses underestimate the extent of their data quality problems. There is little point in performing a data warehouse, migration, integration, master data management, artificial intelligence, or business intelligence project if the underlying data is in bad shape.

The Data Quality Audit is designed to analyze representative samples of the source data and discover their data quality characteristics so that these can be articulated to all relevant project personnel. The project leaders can then decide what actions, if any, are necessary to correct data quality issues and ensure that the successful completion of the project is not in jeopardy.

The Data Quality Audit can typically be conducted very quickly, but the actual time required is determined by the starting condition of the data and the success criteria defined at the beginning of the audit. The main steps are as follow:

  • Representative samples of source data from all primary areas are provided to the Data Quality Developer.
  • The Data Quality Developer uses a data analysis tool to determine the quality of the data according to several criteria.
  • The Data Quality Developer generates summary reports on the data and distributes these to the relevant roles for discussion and next steps.

Two important aspects of the audit are (1) the data quality criteria used, and (2) the type of report generated.

Data Quality Criteria

Any number and type of criteria can be defined for data quality. However, there are six standard criteria:

  • Accuracy is concerned with the general accuracy of the data in a dataset. It is often determined by comparing the dataset with a reliable reference source, for example, a dictionary file containing product reference data.
  • Completeness is concerned with missing data, that is, fields in the dataset that have been left empty or whose default values have been left unchanged. For example, many data input fields have a default date setting of 01/01/1900. If a record includes 01/01/1900 as a date of birth, it is highly likely that the field was ever populated.
  • Conformity is concerned with data values of a similar type that have been entered in a confusing or unusable manner, for example, telephone numbers that include/omit area codes.
  • Consistency is concerned with the occurrence of disparate types of data records in a dataset created for a single data type (e.g., the combination of personal and business information in a dataset intended for business data only).
  • Integrity is concerned with the recognition of meaningful associations between records in a dataset. For example, a dataset may contain records for two or more family members in a household, but without any means for the organization to recognize or use this information.
  • Duplication is concerned with data records that duplicate one another's information, that is, with identifying redundant records in the dataset or records with meaningful information in common. For example:
    • A dataset may contain user-entered records for Batch No. 12345 and Batch 12345, where both records describe the same batch.
    • A dataset may contain several records with common surnames and street addresses, indicating that the records refer to a single household; this type of information is relevant to marketing personnel.

This list is not absolute; the characteristics above are sometimes described with other terminology, such as redundancy or timeliness. Every organization’s data needs are different, and the prevalence and relative priority of data quality issues differ from one organization and one project to the next. Note that the accuracy factor differs from the other five factors in the following respect: whereas, for example, a pair of duplicate records may be visible to the naked eye, it can be difficult to tell simply by eye-balling if a given data record is inaccurate. Accuracy can be determined by applying fuzzy logic to the data or by validating the records against a verified reference data set.

Table of Contents

Success

Link Copied to Clipboard