This Best Practice describes the rationale for matching in real-time along with the concepts and strategies used in planning for and developing a real-time matching solution. It also provides step-by-step instructions on how to build this process using Informatica’s PowerCenter and Data Quality.
The cheapest and most effective way to eliminate duplicate records from a system is to prevent them from ever being entered in the first place. Whether the data is coming from a website, an application entry, EDI feeds messages on a queue, changes captured from a database, or other common data feeds, taking these records and matching them against existing master data that already exists allows for only the new, unique records to be added.
Performing effective real-time matching involves multiple puzzle pieces.
The first prerequisite for successful matching is to cleanse and standardize the master data set. This process requires well-defined rules for important attributes. Applying these rules to the data should result in complete, consistent, conformant, valid data, which really means trusted data. These rules should also be reusable so they can be used with the incoming transaction data prior to matching. The more compromises made in the quality of master data by failing to cleanse and standardize, the more effort will need to be put into the matching logic, and the less value the organization will derive from it. There will be many more chances of missed matches allowing duplicates to enter the system.
Once the master data is cleansed, the next step is to develop criteria for candidate selection. For efficient matching, there is no need to compare records that are so dissimilar that they cannot meet the business rules for matching. On the other hand, the set of candidates must be sufficiently broad to minimize the chance that similar records will not be compared. For example, when matching consumer data on name and address, it may be sensible to limit candidate pull records to those having the same zip code and the same first letter of the last name, because we can reason that if those elements are different between two records, those two records will not match.
There also may be cases where multiple candidate sets are needed. This would be the case if there are multiple sets of match rules that the two records will be compared against. Adding to the previous example, think of matching on name and address for one set of match rules and name and phone for a second. This would require selecting records from the master that have the same phone number and first letter of the last name.
Once the candidate selection process is resolved, the matching logic can be developed. This can consist of matching one to many elements of the input record to each candidate pulled from the master. Once the data is compared each pair of records, one input and one candidate, will have a match score or a series of match scores. Scores below a certain threshold can then be discarded and potential matches can be output or displayed.
The full real-time match process flow includes:
Determining which records from the master should be compared with the incoming record is a critical decision in an effective real-time matching system. For most organizations it is not realistic to match an incoming record to all master records. Consider even a modest customer master data set with one million records; the amount of processing, and thus the wait in real-time would be unacceptable.
Candidate selection for real-time matching is synonymous to grouping or blocking for batch matching. The goal of candidate selection is to select only that subset of the records from the master that are definitively related by a field, part of a field, or combination of multiple parts/fields. The selection is done using a candidate key or group key. Ideally this key would be constructed and stored in an indexed field within the master table(s) allowing for the quickest retrieval. There are many instances where multiple keys are used to allow for one key to be missing or different, while another pulls in the record as a candidate.
What specific data elements the candidate key should consist of very much depends on the scenario and the match rules. The one common theme with candidate keys is the data elements used should have the highest levels of completeness and validity possible. It is also best to use elements that can be verified as valid, such as a postal code or a National ID. The table below lists multiple common matching elements and how group keys could be used around the data.
The ideal size of the candidate record sets, for sub-second response times, should be under 300 records. For acceptable two to three second response times, candidate record counts should be kept under 5000 records.
The following instructions further explain the steps for building a solution to real-time matching using the Informatica suite. They involve the following applications:
Scenario:
Solution:
3. Now that the customer master has been loaded, a Web Service mapping must be created to handle real-time matching. For this project, assume that the incoming record will include a full name field, address, city, state, zip, and a phone number. All fields will be free-form text. Since we are providing the Service, we will be using a Web Service Provider source and target. Follow these steps to build the source and target definitions.
IN_GROUP_KEY_1,IN_FIRSTNAME_1,IN_FIRSTNAME_ALT_1,
IN_MIDNAME_1,IN_LASTNAME_1,IN_POSTNAME_1,
IN_HOUSE_NUM_1,IN_STREET_NAME_1,IN_DIRECTIONAL_1,
IN_ADDRESS2_1,IN_SUITE_NUM_1,IN_CITY_1,IN_STATE_1,
IN_POSTAL_CODE_1,IN_GROUP_KEY_2,IN_FIRSTNAME_2,
IN_FIRSTNAME_ALT_2,IN_MIDNAME_2,IN_LASTNAME_2,
IN_POSTNAME_2,IN_HOUSE_NUM_2,IN_STREET_NAME_2,
IN_DIRECTIONAL_2,IN_ADDRESS2_2,IN_CITY_2,IN_STATE_2,
IN_POSTAL_CODE_2
5. The mapping will consist of :
The output of the SQL transform will be the incoming customer record along with the candidate record. These will be stacked records where the Input/Output fields will represent the input record and the Output only fields will represent the Candidate record. A simple example of this is shown in the table below where a single incoming record will be paired with two candidate records:
7. The web service is ready for testing