Challenge
Effective match rules are central to a successful MDM implementation. The challenge is in arriving at the optimal match rules for a given implementation. This Best Practice provides guidelines on how to do that.
Description
This Best Practice describes the general steps in the set up and tuning of match rules. These steps are iterative and repeatable depending on changes to match requirements.
Data Audit
Data Audit should overlap with data discovery. The data analysis may be performed exclusively for the purpose of the match process but would normally be part of the over-all data analysis.
Get a Reasonable Sized Representative Sample
It is essential that the sample data is representative of the data that the production system will contain. Be wary of data coming from TEST systems; it is often developer created and not representative of real data.
Note: A “reasonable sized” sample should be as large and comprehensive as the user can afford.
Discuss with Users/Business to Determine High-Level Match Requirements/Rules
It is necessary to discover (from a source as close as possible to the “acceptance” users) what is considered “MATCHING.” Consider the following questions:
- What is considered a Match?
- What is not considered a Match?
- Are there any samples from prior systems that can be shared?
Data Investigation
Discover fields that will contribute to the match process (including the match key). Use Informatica's Data Profiler, IIR Edit Rule Wizard or SQL queries to conduct the analysis. Assess the quality of each field and combination of fields.
Group Identification: Run simple group statistics on single or multiple field data. This will allow the assessment of major groupings (e.g., state codes). This should help identify:
- Large group of identical records.
- Levels of exact match duplicates (these feeds into whether to use exact match rules).
- Good candidates for filters. Filters are used as exact match columns to reduce the number of candidates sent to match. A lack of filters may result in a large group of candidates, which will negatively impact performance.
- Viability of the intended key column.
Data Quality
Data Completeness: Decide upon the degree to which the data columns are complete; in other words, how often the field has a non-null value. Look for data completeness by column and combination of columns. For example:
- Determine the total number of records that have both first and last names valued.
- If the postal code field is valued in only 50% of the records, it may not be a good candidate as an exact column.
Data Accuracy: Ensure that the data is accurate. For example, verify that the gender field only contains gender values.
- Use pattern analysis to assess the quality of data in column. Pattern analysis is useful in analyzing data that conforms to certain formats or data types (e.g., postal codes).
- nnnnn-nnnn - 80% (full zip code with hyphen)
- nnnnn - 10 % (five-digit zip code)
Suspect Data: Look for extraneous data; these are strings that may need to be removed. This can be detected by browsing the data and/or using Informatica Data Profiler.
Determine the Match Population
Determine if the data will be supported by a standard population (e.g., US, International) or need a customized population. If there is mixed data from different languages, consider using multiple populations (e.g., USA, Japan, Chinese, etc.).
Word Frequency
Use the IIR edit wizard to retrieve a frequency list using sample data. This will:
- List all the word/phrases in the sample data. This provides a frequency list of individual words/phrases within a data field.
- Provide frequency counts.
- Inform the user if there is an IIR rule for the word/phrase.
These results will help validate the population choice or identify if any tweaking is required.
Data Standardization
Using results from the data audit, set up cleanse functions to standardize the data (e.g., junior becomes JR).
Use an address cleanse tool to standardize and clean the address information.
Define Fuzzy Match Key
Set up the match Key based on a discussion with the business users and data audit. It is recommended to use the following as match keys:
- If data contains organization names or both organization and individual names, use the organization name as the match key.
- If data only contains individual names, use the individual’s name as the match key.
- If the data only contains addresses, use address part1 as the match key.
Define the Key Width
- How big is the target dataset going to be?
- How important is missing a match versus performance?
- Is the data quality generally good?
- The “wider” the key:
- The higher chance of finding a match.
- The lower the overall performance will be.
- The options range from widest to narrowest (Extended, Standard, Limited and Preferred).
- With the sample adjust this setting and determine what is lost and gained.
- Run Generate Match Tokens.
- Run statistics on IIR keys; run group summary statistics on the IIR keys.
- Identify suspects IIR keys.
- Review those keys that have the highest count compared to others, review the actual data and determine if this will cause problems when matching. For example:
- Hot Spots are large groups of identical keys due to identical records.
- The following are areas to explore:
- Is it necessary to cleanse the data?
- Is it necessary to flag the suspect records for later match?
- Is it necessary to change the key?
- Is it necessary to add more to the current key?
- Is it necessary to change the key width or type?
- Is the strip table too big?
- What about “Match Only Once”?
Set Up Match Rules
- Set up draft match rules based on a discussion with the business users and data audit.
- When creating match rules, start with the rules that will provide the tightest matches; these will usually be the exact match columns (such as identifiers) with identifiers. Use name/address for added accuracy.
- Define the Match levels, Search levels and Match purposes.
- Start with narrow followed by typical and then extreme. Aim to have the most conservative matching that the users find acceptable; the “cost” of a false positive is generally less than a false negative.
Name and Address Dry Run
It is advised to run the name and address fuzzy matches before using match rules that include unique identifiers. This will determine the match levels based on name and address exclusively.
Use all match levels: conservative, typical and loose. Perform this in the dry run to find comparable numbers of the different level of matching; if time is available, avoid defining three separate rules and rerun the job three times, changing the match level with each iteration.
Run the match rules (without merge and BMG).
Review Name and Address Match
Run a query on the MTCH table and group by match rule. These results should provide an initial perspective of where the match falls and which match levels to use. For example, if the majority of the matches are based on a conservative match level, then it is recommended to use conservative.
Run a quick stat of the match results by match rule. This stat should provide a rough estimate of the match results by match rule. It is recommended to exclude or revise those rules that have little gained matches
Review detailed match results. This could be accomplished by querying the MTCH table. It is useful to join this query with the BO to allow an easy comparison of the results.
Make a copy of the MTCH table. It is recommended to compare the results to subsequent match results using different match rules.
This review should clarify the following aspects:
- Over and under matching.
- In reviewing detailed matched records, it is best to save the detailed results to a table or export to an Excel Spreadsheet. This allows a re-review if required or comparison with other runs.
- Avoid reviewing matched records that have the exact name and address. Concentrate on matched records with discrepancies.
- It helps to categorize the review records. For example, one category is “all records” where the addresses are the same but having one or more discrepancies in the name.
- While reviewing the match results, use IIR workbench to help analyze the match results. Using this tool will help determine why IIR matched or did not match records. By tweaking the rules on the IIR workbench, it is possible to determine what adjustments are necessary for the match rules.
Set Final Match Rules with Exact Columns
Once the composition of the name and address match rules have been decided, set up the final match rules. At this point, add the unique identifiers, such as IMS number.
Include an exact column in the match rules, such as zip code, if there is no unique identifier. Add several identical rules by varying their exact filter.
If there is no reason to differentiate between a tight or loose match, only include the loose match rules in the final set.
Use the options, such as “Match Only Once,” “Dynamic Match Analysis” or “Match only Previous,” especially for the initial load. Toward the end of the review process, set these options on and compare the MTCH table to the earlier iterations to see what matches (if any) are missing. Determine if these acceptable for the performance trade off.
Review Match Results
Run the complete match set. Review the results and try to get acceptance user involvement as early as possible; it is recommended to receive step-by-step assistance from the acceptance user. If there are issues, reiterate the stages
Review STRP Table
Use console Histogram to check for large key sets. It is necessary to analyze the STRP table to look for:
- IIR Key ratio: average number of keys per “rowid_object”
- Calculate this using the cleanse server tokenization batch summary report. Key ratios less than ten should be investigated.
- Large IIR keys
- This implies that a particular string has a high level of frequency in the data set. Are there extraneous data that need to be dealt with (e.g., “Estate of” or “Medical”)?
- If there are an excessive number of keys being generated, then investigate the source. Are there a relatively small number of bad records? Are they from a particular source?
Sample Server Log Summary
[2011-09-05 19:50:00,784] [WebContainer : 3] [INFO ] com.siperian.mrm.match.tokenize.Tokenize: Tokenize Summary
[2011-09-05 19:50:00,784] [WebContainer : 3] [INFO ] com.siperian.mrm.match.tokenize.Tokenize: =============
[2011-09-05 19:50:00,784] [WebContainer : 3] [INFO ] com.siperian.mrm.match.tokenize.Tokenize: Total BO Records read:13156618
[2011-09-05 19:50:00,784] [WebContainer : 3] [INFO ] com.siperian.mrm.match.tokenize.Tokenize: Total Records read :62725732
[2011-09-05 19:50:00,784] [WebContainer : 3] [INFO ] com.siperian.mrm.match.tokenize.Tokenize: Keys generated :310127101
[2011-09-05 19:50:00,784] [WebContainer : 3] [INFO ] com.siperian.mrm.match.tokenize.Tokenize: Average Record Size :115
[2011-09-05 19:50:00,784] [WebContainer : 3] [INFO ] com.siperian.mrm.match.tokenize.Tokenize: Largest Record Size :200863
[2011-09-05 19:50:00,784] [WebContainer : 3] [INFO ] com.siperian.mrm.match.tokenize.Tokenize: Largest Record Rowid :2033307
[2011-09-05 19:50:00,784] [WebContainer : 3] [INFO ] com.siperian.mrm.match.tokenize.Tokenize: Total time used 21.399084 mins
The following are sample SQL queries to analyze the STRP table:
- This query will show how many records there are for the most populous 5,000 keys in the DB:
- select * from (select SSA_KEY, count(*) c1 from c_customer_strp where Data_row=1 group by ssa_key order by c1 desc) where rownum < 5000
- This will show what the top 5,000 record key size is:
- select * from (select rowid_object,count(*) c1 from c_customer_strp where Data_row =1 group by rowid_object order by c1 desc) where rownum < 5000
- This will show what the top 5,000 records STRP table records count is:
- select * from (rowid_object, data_count from c_customer_strp where data_row =1 and preferred_key_ind =1 order by data_count desc) where rownum < 5000
Review Cleanse Server Log
Review the cleanse server log while a Match batch job is running and after it completes.
- Identify “Hot Spots”
- Look for long-running match ranges (i.e., if a range takes long to be processed). For example, if stats keep appearing for a range, then this is a potential hot spot).
- Review Match Summary
- Match processing time, number of records processed, total auto and manual matches
- Top 10 Range Comparisons counts
- Large range comparison counts may indicate a data issue
Check If Matches are Running Too Slow
The following are potential reasons:
- Hot spots or a large number of comparisons
- Slow DB read
- Low thread count
- Slow disk read /write
- Logging set to Debug mode
- Cleanse/Remove bad data
- Set bad data aside (i.e., prevent from matching)
- Use match path filter to reduce number of child records
- Add extraneous data as Noise Words
- Add Noise words to IIR population
- Improve DB and Disk speed
- Increase thread count
- Use Distributed Matching
- If an exact rule has the fuzzy key components in it, convert it to a “filtered rule”
- Better filtering in rules
- Use Match Bo Level Settings
- Dynamic match threshold
- Match only once
- Change Match Key width
Matching in MDM SaaS
Aside from what has been previously mentioned, match tuning in MDM SaaS can vary slightly from that of MDM Hub. Some key differences are in two key areas: the first is how the results of a match run are reviewed and the second is Match training from the Matching done from machine learning.
When running a Match job in MDM SaaS upon completion a match summary is generated. This summary will provide up to 150,000 matches and provided information on items such as what rule the record was matched on, the source of the record, and what rule was the record matched on. The reviewal of this summary should be done hand and hand with the business unit or whoever will be creating the acceptance criteria.
The reviewal process should be done both at a high level, gaining summary statics on the current iteration, as well as at a more granular level, this means going through a set of matches pairs and determining if it is a proper match or not.
After an iteration is reviewed changes can be made directly to the match model and then republished. This will save time as well as keeping a version of the old model.
Do's and Don'ts of Match and Merge Configuration
Consider the following do's and don'ts when configuring match and merge:
- When data is ingressed, ensure that loading transactional data is avoided. Use Cloud Data Integration to filter out transactional data. Highly duplicate transactional data impacts performance negatively by generating excessive number of match comparisons.
- When a threshold-based merge strategy is used, understand the rule outcomes for each scenario.
- When name and address for matching without unique identifiers is used, ensure that tight thresholds are configured to avoid false positives. For example, multiple family members can share the same address and the records can't be definite matches.
- When exact match rules are used, configure the appropriate fields for the candidate selection criteria.
- If required, regenerate the match keys after updating the candidate selection criteria. For example, if the default population is changed, the match criteria values in a declarative rule change. Ensure that the match keys for the records are regenerated.
- Use the NOT_READY_FOR_MATCH flag to mark records to exclude from the match process. If a source record does not have enough information to contribute to the master record, mark it as not ready for match.
- During match tuning, perform match process first. The records can be rematched until an optimal match configuration is reached.
- During initial match tuning, use a smaller data set to test. Processing a smaller data set minimizes loss of time in case of match job failure and allows for multiple tuning iterations. Ensure that match rules are configured to accommodate expected data volumes.
- Avoid choosing multiple candidate selection criteria unless it is essential and is a requirement for your use case.
- Avoid manual match rules that use loose match level. Don’t flood the data stewards with too many matches.
Match and Merge Best Practices Checklist
The following checklist summarizes the best practices for configuring match and merge:
- Know your data.
- Define candidate selection criteria.
- Set the maximum candidate limit.
- Define fuzzy match fields.
- Define exact match fields.
- Set the merge threshold limits.
- Leverage segment matching.