• Success
    Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs
    Success
    Accelerate your Purchase to Value engaging with Informatica Architects for Customer Success
  • Communities
    A collaborative platform to connect and grow with like-minded Informaticans across the globe
    Communities
    Connect and collaborate with Informatica experts and champions
    Have a question? Start a Discussion and get immediate answers you are looking for
    Customer-organized groups that meet online and in-person. Join today to network, share ideas, and get tips on how to get the most out of Informatica
  • Knowledge Center
    Troubleshooting documents, product guides, how to videos, best practices, and more
    Knowledge Center
    One-stop self-service portal for solutions, FAQs, Whitepapers, How Tos, Videos, and more
    Video channel for step-by-step instructions to use our products, best practices, troubleshooting tips, and much more
    Information library of the latest product documents
    Best practices and use cases from the Implementation team
  • Learn
    Rich resources to help you leverage full capabilities of our products
    Learn
    Role-based training programs for the best ROI
    Get certified on Informatica products. Free, Foundation, or Professional
    Free and unlimited modules based on your expertise level and journey
    Self-guided, intuitive experience platform for outcome-focused product capabilities and use cases
  • Resources
    Library of content to help you leverage the best of Informatica products
    Resources
    Most popular webinars on product architecture, best practices, and more
    Product Availability Matrix statements of Informatica products
    Monthly support newsletter
    Informatica Support Guide and Statements, Quick Start Guides, and Cloud Product Description Schedule
    End of Life statements of Informatica products
Last Updated Date May 25, 2021 |

Configuring and Tuning Match Rules

This article provides information on how to use and tune match rules. It is recommended reading for all implementers and also Informatica MDM Hub administrators.

About Matching

Matching is how Informatica MDM Hub identifies data duplicates.

Before Defining the Match Rules

Before beginning the process of defining and refining the match rules, it is essential to be familiar with the data. It is necessary to know:

  • how complete the data is. Are base object records sparsely populated, with many fields that are NULL?
  • how clean the data is. Does the quality of the data generate confidence? Is the data that is there relatively accurate? Are there a lot of word and character transpositions?
  • what  proportion of the data are likely to be duplicates? Data that has many duplicates is referred to as “matchy.”
  • in the columns to be used for matching, what is the expected variation in the values? This expected variation is called the cardinality.
  • which data is suitable for exact matching, and which is better for fuzzy matching?
  • Fuzzy matching takes into account variations such as word order. Exact matching doesn’t take into account any variations, but it does have performance advantages.

Steps in the Match Process     

The match process consists of the following steps:

  1. Generate tokens that encode the data for searching for possible match candidates. To learn more, see “Tokens for Match Keys.”
  2. Search the data for possible match candidates.
  3. Apply the match rules to the search results to return matches.

Informatica MDM Hub uses the parameters the user set for the match to generate a score that describes the degree to which rows match. Select a range that defines what constitutes a match. If the score is within the range selected, then those rows are returned as matches. Select the range by setting the match level.

Populations     

Informatica MDM Hub uses the concept of populations to encapsulate intelligence about customer name and address data for particular geographic groups. For example, different countries use different formats for addresses. These differences include such things as the placement of the street number and street name, location of the postal code, and other variations in addresses. In addition, different populations have different distributions of surnames. For example, US name data typically has Smith as 1 percent of the surnames. Other populations have other distributions. Informatica MDM Hub uses this intelligence to more effectively match name and address data.

Tokens for Match Keys

A token (also called a match key) is a fixed-length compressed and encoded value built from a combination of the words and numbers in a name or address such that relevant variations have the same key value. For one name or address, multiple match keys are generated. The number of keys generated per base object record varies, depending on the data and the match key level.

Informatica MDM Hub fuzzy matching uses tokens as a basis for searching for potential matches. Tokens allow Informatica MDM Hub to match rows with a degree of fuzziness; the match need not be exact to be considered a match. The process of generating tokens is called tokenization. Before using fuzzy matching, these tokens must be generated.

For example, the following strings generate the following tokens:

String

Token

BETH O'BRIEN

MMU$?/$-

BETH O'BRIEN

PCOG$$$$

BETH O'BRIEN

VL/IEFLM

LIZ O'BRIEN

PCOG$$$$

LIZ O'BRIEN

SXOG$$$-

LIZ O'BRIEN

VL/IEFLM


Note:
 The tokens that are generated depend on the data and the parameters set for match keys.

When searching for match candidates, LIZ O'BRIEN and BETH O'BRIEN are considered as candidates, because they have some key values in common.

With respect to tokens, Informatica MDM Hub does several things during match:

  • Informatica MDM Hub checks for the tokenization incomplete indicator. If the last tokenization process started but didn't finish, this indicator is set. If the tokenization incomplete indicator is set, Informatica MDM Hub re-tokenizes the data before matching.
  • Informatica MDM Hub checks for the dirty indicator. The dirty indicator indicates that an update occurred after the last time this data was tokenized. The dirty indicator can propagate from a child to a parent record. A value of 0 in the dirty indicator indicates that the record in the token table is up to date. If the record is not up to date, Informatica MDM Hub tokenizes the data before matching.

After generating the tokens, the next step in the match process is to get match candidates from the database using the keys defined for the names or addresses. This is done using the match keys generated on the column(s) selected to form the match key.

Determining When to Tokenize the Data

Tokenization of data can occur at any of these times:

  • when it is loaded
  • when it is put into the table (using the PUT or CLEANSE_PUT API calls)
  • right before matching

The default setting is to not tokenize when loading or PUT-ing the data. It may be helpful to change this setting for either of the following reasons:

Do not use the Generate Match Tokens on Put option if using the API. If this parameter is set on, the Put and CLEANSE_PUT API calls will fail. Use the TOKENIZE verb instead. Only turn on Generate Match Tokens on Put when not using the API or for data steward updates from the console to be tokenized immediately.

Match Key Widths   

Informatica MDM Hub supports the following key widths:

  • Standard Keys
  • Extended Keys
  • Limited Keys
  • Preferred Keys

These widths represent tradeoffs between the match precision and the space used by the tokens. The space used is determined by the number of tokens generated.

For typical customer data, use the standard key width. The number can vary based on the data, but generally the standard key width generates approximately five or six token records per base object record.

Extended keys support more variation in the values for the key, but also generate more records in the token table, about 10 to 12 token records for every base object record.

Limited keys support less variation in the values used for the key, but the token table is also much smaller, with perhaps two to three token records per base object record. If the data has character transpositions in the data used for the key, limited keys may not be the best choice.

Preferred keys generate a single key per base record. This reduces the number of comparisons and increases performance but can result in returning fewer matches than other key width options. Use this option if there are high volumes of high quality data.

Match Key Types and Mixed Data

The match key type selected has a big effect on the match results.

For Party objects that include organizations and people in the same object, the match key type must be Organization_Name, and it must be based on the full name column from the Party object. The full name field must be populated for all records and for individuals it should at least include first name, middle name and last name.      

Search Strategies

The search strategy determines how many candidates are returned in the search phase of the match process. The number of candidates has a direct effect on the number of matches returned and the amount of time it takes Informatica MDM Hub to apply the match rules.

The search strategy used to determine the set of candidates for matching must find the balance between finding all possible candidates and not slowing the process with too many irrelevant candidates.

Applications dealing with relatively clean and complete data can use a high performance strategy, while applications dealing with less clean data or with more critical duplication issues must use more complex strategies.

To achieve this, four search strategies or search levels are supported:

  • Narrow
  • Typical
  • Exhaustive
  • Extreme

Narrow gives the best performance but supports the least complexity, as it generates the fewest candidates. Extreme supports the highest level of complexity but gives the worst performance, as it generates the most candidates.

For typical customer data, a search strategy of typical is usually appropriate. It may be helpful to change this to narrow for very large data volumes or highly “matchy” data. Alternately, if there is a small data set or if it critical that the highest possible number of matching records be identified, then use the exhaustive or extreme search levels instead.

If both performance and completeness of match are critical, then consider a two phase approach to the match process: in the first phase, use a narrow search level to more quickly match and then merge highly similar records. Then switch to a different rule set that uses extreme or exhaustive search levels to provide the more complex and complete searches for candidates.

Match Purposes

The match purpose describes the overall goal of a match rule. The match purpose is very important because it determines which columns are used for matching. The list of match purposes available is determined by the population selected.

Each match purpose supports a combination of mandatory and optional fields and each field is weighted according to its influence in the match decision. Some fields in some purposes may be grouped. There are two types of groupings:

  • Required - requires at least one of the field members to be non-null
  • Best of - contributes only the best score from the fields in the group to the overall match score

For example, in the Individual match purpose:

  • Person_Name is a mandatory field
  • One of either ID Number or Date of Birth is required
  • Other attributes are optional

The overall score returned by each purpose is calculated by adding the participating field scores multiplied by their respective weight and divided by the total of all field weights. If a field is optional and is not provided, it is not included in the weight calculation.

Using the Match Purposes to Match People

When matching people, if the match includes the address fields, then the Resident purpose is better than the Individual purpose. However, in order to match on person and external IDs, do not use Resident because it requires the address information. In that case, use Individual.

Using the Match Purposes to Match Organizations

When matching organizations, the Division purpose is better than the Organization purpose. Organization allows organizations without addresses to match with organizations with addresses, which may not be desired. Division only matches records with similar addresses.

Using the Match Purposes to Match Addresses

For match purposes that use address, don't use Address_Part_2 on its own without Address_Part_1. If zip or city matches are the only option, then add an exact match column on zip or city. Just using Address Part 2 gives a very loose match. Alternately, it is possible to add a column using Postal_Area instead of exact match on city/zip.

Name Formats

Informatica MDM Hub match has the concept of a default name format which tells it where to expect the last name. The options are:

  • Left-last name is at the start of the full name, for example Smith Jim
  • Right-last name is at the end of the full name, for example, Jim Smith

The name format used by Informatica MDM Hub depends on the purpose being used. If using Organization, then the default is Last name, First name, Middle name. If using Person/Resident, then the default is First Middle Last.

Be aware that when formatting data for matching, there are edge cases that are helpful, particularly for names that do not fall within the selected population.

Field Types Used in Purposes

For details on the field names available for a fuzzy match column, please see the Informatica MDM Configuration Guide, chapter “Configuring the Match Process”, section Configuring Match Columns.

Match Levels

In in conjunction with the match purposes, choose one of three different match levels.

Defining and Testing the Match Rules

When defining the match rules, keep the following points in mind:

  • Identify records with large numbers of similar values in the match key field. This is called “matchy” data. Determine whether those records should be considered for matching. If they shouldn't be considered for matching, then flag those records as consolidated before running the match. If they should be considered for matching, then determine whether the Match for Duplicate Data functionality can be used to quickly match and merge the records.
    • Examples of such records are the health-care customers named 'GROUP PRACTICES', 'MULTIPLE DOCTORS' and the zip-aligned customers where the names are all the same with the exception of the zip numbers at the end.
    • Such records all generate the same match keys, resulting in an enormous pool of records for the match to compare against each other, significantly skewing the match data set and negatively affecting performance.
  • For a Party base object - i.e. one that contains both organizations and individuals: create different rules for organizations and individuals based on a customer type / customer class indicator. For each rule, use a match purpose that is appropriate for the customer type.
  • In order to do an exact match on an attribute, then include that attribute as an exact match column. It makes a significant difference to performance, as it acts as a filter on the match. If all the columns are fuzzy match columns, then performance will be lacking.
    • If suffix (Jr., Sr., II, III, etc.) is important in the match, then define it as an exact match column and switch on null matching on that column. If it is only a part of the full name used in Informatica MDM Hub matching, this requires matching records that do not have a suffix with records that do have a suffix.
  • If there is not much variation in the values in the focus column (i.e. low cardinality column) or if there is little in the way of misspellings and character transpositions in the data, then use an exact match base object instead of a fuzzy match base object. The performance is significantly better. An example of an ideal candidate for an exact match base object is an External Identifier base object that stores identifiers, such as social security numbers, license numbers, etcetera.

About Testing 

For prototyping and testing the rules, use random data that is of both reasonable quality and quantity. Do not build the prototype to search on made-up names in a development database. Fabricated data will not give an accurate picture of how the rules will behave in a production environment. Use a random sample of real data.

Understand the business and performance needs of the match. There is a natural conflict between performance and completeness of search. To balance these conflicting requirements, choose the search level with care. Test the searches using different search levels on real production data.

When making this judgment, consider measures of completeness (the percentage of known matches found) against measures of performance (how long the search transaction or batch job took). Choose the one that best conforms to the business requirements.

When measuring match completeness, it is best to have a known set of expected search results. When measuring performance, in addition to ensuring the actual production volume of data is being searched, also take into account network and machine load overhead.

Matching Best Practices

Keep these considerations in mind when defining and tuning the rules:

  • The more fields and columns given to Informatica MDM Hub the better, as that helps it get better, and sometimes more, matches. These additional fields provide additional context for the match. This context allows Informatica MDM Hub to make decisions about which columns in a match have higher or lower levels of importance in determining the outcome of the match.
  • If adding exact match columns, these columns have a filtering effect. Exact match columns are applied before fuzzy columns are considered. They result in the set of match candidates being reduced to only those records that have the same value in the exact match column as at least one other record in the exact match set.
  • If adding fuzzy match columns, these columns do not have a filtering effect on the match. They do not filter out matches on their own; the match engine evaluates all fuzzy columns before determining whether two records are a match.
  • Avoid breaking composite values down into their constituent parts, as doing so removes much of the context information that Informatica MDM Hub can derive from the way in which the elements in the composite value are defined. For example, pass a person's full name with as much detail in that field as possible - first name, middle name, last name, suffix, etc. - instead of parsing that field out into first name, middle initial, last name etc.
  • Do not filter values out of the data for matching. For example, if parsing suffix (Jr., Sr., etc.) from the full name to include it in an exact match column, do not remove it from full name field.

Exact Match Column Properties

For exact match column, specify properties that alter the standard exact match rule behavior.

Informatica MDM Hub supports the following properties for exact match: Null Match and Segment Match.

Null Match

The standard behavior of Informatica MDM Hub matching is to treat each NULL value as a placeholder for an unknown value. By default, Informatica MDM Hub treats nulls as unequal in a match. Alter this behavior by enabling the Match NULLs property. When enabling NULL Matching, these options are possible:

Property

Description

Disabled

Regardless of the other value, nothing will match (nulls are unequal values). Default setting.

NULL Matches

NULL

If the other value is also NULL, it is considered a match. A null value is treated as a particular value in its own right. This means a NULL value matches another NULL value, but does not match any other value.

NULL Matches

Non-NULL

If the other value is not NULL, it is considered a match. a NULL value is treated as missing data, so it matches to non-NULL data.


Use null match in cases where there is no lost data when there is a null value. For example, null match is sensible for middle name or suffix columns. Conversely, do not use null match in cases where null match could produce an incorrect result. For example, null match is generally inappropriate for a first name column.

When enabling null match for a rule, the rule is rarely applied since most data has relatively few nulls. Generally speaking, more match rules means more overhead, and higher overhead can have an effect on performance. Typically, if there are ten match columns, only enable null match on one or two of those columns. The best way to determine if null matching is appropriate is to know the data and test the rules.

Note: Null match cannot be enabled on the same column being used for a segment match. See “Segment Match” below for more information.

Segment Match        

Segment matching is useful for cases which have different classes of information in the base object. In this case, different match rules may be necessary to apply to different types of data. For example, if there is a base object that contains customer information for medical products. This base object contains information for individual doctors, group practices, HMOs, and hospitals. Create a column that indicates the type of information the record contains: individual, group and etcetera. Each of these subsets of the records is referred to as a segment.

Note: Segment matching doesn’t support recursive relationships. An example of a recursive relationship is a group practice in part of a clinic, which, in turn, is part of a hospital.

This example is better suited to the Segment matches All Data situation. A more common example, and a more illustrative one, is when organizations and individuals are in the same Customer table, and there is a desire to match organizations with organizations and individuals with individuals.

A common scenario where segment matching is useful is a customer base object with individual records, as well as organization records. Match organizations to organizations and to customers, and individuals to individuals. Do not match individuals to organizations. The following rows demonstrate this example of segment matching:

Customer Name

Customer Class

ABC, Inc

O

ABC Company

O

Annette Curtin

I

A Curtin

I


It is possible to create specific match rules for each segment, resulting in different rules for different types of data. It is also possible to specify the name of a segment.

Using Segment Matches All Data

Generally, segments are used to match within subsets of data. For example, if there is a column called MATCH_COLUMN_SEGMENT, and its values are (“A”, “B”, “C”). To match within the “B” segment, create a rule that only generates matches when MATCH_ COLUMN_SEGMENT = “B.” Informatica MDM Hub only generates matches against other rows whose segment is also B. By turning Segment Matches All Data on, it matches all the rows in the “B” segment against any other segment. To use the sales leads/customer database example, by choosing Segment Matches All Data, Informatica MDM Hub matches sales leads against everything. If this checkbox is not selected, Informatica MDM Hub only matches sales leads against sales leads.

A common scenario where segment matching is useful is a base object with customer records, as well as sales lead records. Match leads to customers, but never the other way around. The following rows demonstrate this example of segment matching:

Customer Name

Sales Lead Flag

ABC, Inc

C

AB Inc

C

ABC, Inc

L

It is possible to create specific match rules for each segment, resulting in different rules for different types of data. It is also possible to specify the name of a segment.

If the segment matches all data option is selected, then Informatica MDM Hub starts with the records in the specified segment and attempts to find matches for those records in the entire base object. For example, the base object contains both sales leads and customers, indicated by either a C (for customer) or an L (for lead) in the segment column. To match leads against both other leads and customers, choose segment matching and Segment Matches All Data. Matching this data without the Segment Matches All Data option would result in match leads against customers, but also match customers against leads, which might result in less reliable data.

Remember that the segment match, and therefore the Segment Matches All Data option, applies to only one match rule. There can be a number of different match rules, some that use the segment and others that do not. Continuing with the lead and customer’s example, the segment match with Segment Matches All Data defines a looser match rule for the leads segment that allows leads to be loosely matched to other leads, as well as customers. Customers should not match to customers on that same loose match rule, since it could result in overmatching the customer records. The segment limits the loose match rule just to leads but would not restrict leads from only matching to other leads.

Using Matching on Dependent Tables

When there are parent and child objects and there is a desire to match the child objects, it is necessary to include the parent object’s ROWID in all match rules for the child object. If not, data will be lost.

For example, when there is a parent table, COMPANY, and a child table, ADDRESS. When matching addresses within a company without including ROWID_COMPANY in all match rules, it is possible to lose a company's address with each merge. The following rows demonstrate this example of segment matching with child tables:

COMPANY_ROWID

Address

12345

100 Main St

54321

100 Main St

By not including the COMPANY_ROWID in all the match rules, these two rows are merged, and there is a single COMPANY_ROWID. If the remaining ROWID is 12345, the company with ROWID 54321 no longer has a record in the address child table. This data is lost.

Setting Match Batch Sizes

The match batch size is the number of records Informatica MDM Hub attempts to match in one group. If the total number of records considered for match and merge exceeds this maximum match batch size, the match process performs the match in cycles. Each cycle is limited to matching the number of records specified by this parameter.

It may seem like a good idea to use a very large match batch size, but the correct size of the match batch depends on the cardinality of the data and the number of matches the rules returns.

Using Dynamic Match Analysis Threshold

Dynamic Match Analysis Threshold is a setting in the Match/Merge setup screen. Dynamic match analysis analyzes the match process at runtime to determine if the match process will take an unacceptably long period of time. The threshold value is used to specify the maximum acceptable number of comparisons.

The analysis is computed by multiplying the number of records in the base match group and the number of records in the token table that must be compared. If this product is less than the threshold, the match proceeds. If it is greater than that threshold, the match is not done, and a message is written to the log, noting the range for further investigation.

Tuning Match for Performance

One of the primary culprits in poor performance is excessive numbers of comparisons. The match process creates a list of match candidates. It is these candidates that are then compared to determine matches. Match candidates are determined by the values in the match columns. For example, when matching a dataset that has pharmacies, and there are 50,000 instances of BigChain Pharmacy. Each of the 50,000 records may be unique, but unless the set of candidates is reduced, there would be 50,000 candidates, each of which must be compared to determine matches. It is this comparison work that directly affects performance. Controlling the number of match candidates is the key to improving match performance.

The performance of the system is a function of many individual things. The following are some basic strategies used to optimize the performance of the matching:

  • The most effective way to improve performance is to know the data. This knowledge enables the user to apply the various strategies for performance optimization and get the best results from the Informatica MDM Hub implementation.
  • All match approaches are tradeoffs between performance and number of matches. Be biased towards undermatching. Undermatching means that some possible matches are missed. Conversely, overmatching means that an excessive number of comparisons are done. This can consume a great deal of processing time and resources, depending on the size of the data set.
  • Exact matches are more efficient than fuzzy matches. When possible, run exact matches to reduce the number of candidate rows before running fuzzy matches. If the data is very “matchy,” build a match rule set that has only exact rules. Run this rule set to get rid of a large number of matches.
  • If there are high volumes of high quality data, use the Preferred key width to improve performance. However, if the quality of the data is lower, this could result in an unacceptable level of undermatching.
  • If the ROWID objects are monotonically increasing, then using the Match Only Previous ROWID option in the Match/Merge Setup screen can improve performance. When this option is set, match comparisons are done only downwards with respect to the ROWIDs. That is row A is matched to row B, but row B is not matched to row A. Setting this option can reduce the number of comparisons by about half. This option is inappropriate in the following cases:
    • Records are inserted out of ROWID order
    • When using the services integration framework with this base object
    • When using user-declared ROWIDs
  • If the data is appropriate for Match Only Previous ROWID Objects, use that and also select Match Only Once. This option means that once record A has been matched with another record, record A is not compared with any other record again. This dramatically reduces the number of comparisons.
  • When testing and tuning the match rules, use the Dynamic Match Analyze Threshold option. To learn more, see “Using Dynamic Match Analysis Threshold” above.
  • Avoid loose manual rules. This only moves the problem to data stewards.
  • If there is a high volume of data:
    • Do not have any unconditional fuzzy match rules.
    • Always have some exact match filters on every rule. These filters reduce the number of candidates for comparison.
    • Create a number of almost identical fuzzy rules with different exact match filters to reduce the number of rows that are compared for the fuzzy match. For example, create rules such as the following:
      • full name and address + exact postcode
      • full name and address + exact state
      • full name and address + exact first two digits of postcode
    • Consider these exact matches when defining the cleansing process. Optimizing the results of the cleanse process to generate good data for the exact matches can significantly improve both performance and the quality of the results. It is easier to make these optimizations when defining the cleanse processes than it is to go back after the data is in the base object and match issues have been found.
    • Never create automerge rules that contain only a single match column.

About Merging

There are two types of merges: automerge, which merges all merges queued for automerge, and manual. Manual merges require a data steward to use the merge manager. These two types of merges are functionally the same.

For all merges, there are two records: the source and the target. When merging A into B, A is the source and B is the target. The only field that is guaranteed to survive the merge is the ROWID of the source.

When the records are merged, the source and the target are the most important. For the purposes of merge, trust on columns does not apply. For non-trusted columns, the source data always survives, and the target data is subsumed. The only time the source data does not survive is when the validation rule is 100% downgrade, 0% minimum reserve trust. In this case, the target field prevails.

Table of Contents

Success

Link Copied to Clipboard