• Success
    Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs
    Success
    Accelerate your Purchase to Value engaging with Informatica Architects for Customer Success
  • Communities
    A collaborative platform to connect and grow with like-minded Informaticans across the globe
    Communities
    Connect and collaborate with Informatica experts and champions
    Have a question? Start a Discussion and get immediate answers you are looking for
    Customer-organized groups that meet online and in-person. Join today to network, share ideas, and get tips on how to get the most out of Informatica
  • Knowledge Center
    Troubleshooting documents, product guides, how to videos, best practices, and more
    Knowledge Center
    One-stop self-service portal for solutions, FAQs, Whitepapers, How Tos, Videos, and more
    Video channel for step-by-step instructions to use our products, best practices, troubleshooting tips, and much more
    Information library of the latest product documents
    Best practices and use cases from the Implementation team
  • Learn
    Rich resources to help you leverage full capabilities of our products
    Learn
    Role-based training programs for the best ROI
    Get certified on Informatica products. Free, Foundation, or Professional
    Free and unlimited modules based on your expertise level and journey
    Self-guided, intuitive experience platform for outcome-focused product capabilities and use cases
  • Resources
    Library of content to help you leverage the best of Informatica products
    Resources
    Most popular webinars on product architecture, best practices, and more
    Product Availability Matrix statements of Informatica products
    Monthly support newsletter
    Informatica Support Guide and Statements, Quick Start Guides, and Cloud Product Description Schedule
    End of Life statements of Informatica products
Last Updated Date May 26, 2021 |

Challenge

Data warehousing incorporates very large volumes of data. The process of loading the warehouse in a reasonable timescale without compromising its functionality is extremely difficult. The goal is to create a load strategy that can minimize downtime for the warehouse and allow quick and robust data management.

 

Description

As time windows shrink and data volumes increase, it is important to understand the impact of a suitable incremental load strategy. The design should allow data to be incrementally added to the data warehouse with minimal impact on the overall system. This Best Practice describes several possible load strategies.

 

Incremental Aggregation

Incremental aggregation is useful for applying incrementally-captured changes in the source to aggregate calculations in a session.

 

If the source changes only incrementally, and you can capture those changes, you can configure the session to process only those changes with each run. This allows the PowerCenter Integration Service to update the target incrementally, rather than forcing it to process the entire source and recalculate the same calculations each time you run the session.

 

If the session performs incremental aggregation, the PowerCenter Integration Service saves index and data cache information to disk when the session finishes. The next time the session runs, the PowerCenter Integration Service uses this historical information to perform the incremental aggregation.  To utilize this functionality set the “Incremental Aggregation” Session attribute. For details see Chapter 24 in the Workflow Administration Guide.

 

Use incremental aggregation under the following conditions:

  • Your mapping includes an aggregate function.
  • The source changes only incrementally.
  • You can capture incremental changes (i.e., by filtering source data by timestamp).
  • You get only delta records (i.e., you may have implemented the CDC (Change Data Capture) feature of PowerExchange).
 

Do not use incremental aggregation in the following circumstances:

 

  • You cannot capture new source data.
  • Processing the incrementally-changed source significantly changes the target. If processing the incrementally-changed source alters more than half the existing target, the session may not benefit from using incremental aggregation.
  • Your mapping contains percentile or median functions.
 

Some conditions that may help in making a decision on an incremental strategy include:

 

  • Error handling, loading and unloading strategies for recovering, reloading, and unloading data.
  • History tracking requirements for keeping track of what has been loaded and when
  • Slowly-changing dimensions. Informatica Mapping Wizards are a good start to an incremental load strategy. The Wizards generate generic mappings as a starting point (refer to Chapter 15 in the Designer Guide)

 

Source Analysis

Data sources typically fall into the following possible scenarios:

  • Delta records. Records supplied by the source system include only new or changed records. In this scenario, all records are generally inserted or updated into the data warehouse.
  • Record indicator or flags. Records that include columns that specify the intention of the record to be populated into the warehouse. Records can be selected based upon this flag for all inserts, updates, and deletes.
  • Date stamped data. Data is organized by timestamps, and loaded into the warehouse based upon the last processing date or the effective date range.
  • Key values are present. When only key values are present, data must be checked against what has already been entered into the warehouse. All values must be checked before entering the warehouse.
  • No key values present. When no key values are present, surrogate keys are created and all data is inserted into the warehouse based upon validity of the records.

 

Identify Records for Comparison

After the sources are identified, you need to determine which records need to be entered into the warehouse and how. Here are some considerations:

  • Compare with the target table. When source delta loads are received, determine if the record exists in the target table. The timestamps and natural keys of the record are the starting point for identifying whether the record is new, modified, or should be archived. If the record does not exist in the target, insert the record as a new row. If it does exist, determine if the record needs to be updated, inserted as a new record, or removed (deleted from target) or filtered out and not added to the target.
  • Record  indicators. Record indicators can be beneficial when lookups into the target are not necessary. Take care to ensure that the record exists for update or delete scenarios, or does not exist for successful inserts. Some design effort may be needed to manage errors in these situations.

 

Determine Method of Comparison

There are four main strategies in mapping design that can be used as a method of comparison:

  • Joins of sources to targets. Records are directly joined to the target using Source Qualifier join conditions or using Joiner transformations after the Source Qualifiers (for heterogeneous sources). When using Joiner transformations, take care to ensure the data volumes are manageable and that the smaller of the two datasets is configured as the Master side of the join.
  • Lookup on target. Using the Lookup transformation, lookup the keys or critical columns in the target relational database. Consider the caches and indexing possibilities.
  • Load table log. Generate a log table of records that have already been inserted into the target system. You can use this table for comparison with lookups or joins, depending on the need and volume. For example, store keys in a separate table and compare source records against this log table to determine load strategy. Another example is to store the dates associated with the data already loaded into a log table.
  • MD5 checksum function. Generate a unique value for each row of data and then compare previous and current unique checksum values to determine whether the record has changed.

 

Source-Based Load Strategies

Complete Incremental Loads in a Single File/Table

The simplest method for incremental loads is from flat files or a database in which all records are going to be loaded. This strategy requires bulk loads into the warehouse with no overhead on processing of the sources or sorting the source records.

 

Data can be loaded directly from the source locations into the data warehouse. There is no additional overhead produced in moving these sources into the warehouse.

 

Date-Stamped Data

This method involves data that has been stamped using effective dates or sequences. The incremental load can be determined by dates greater than the previous load date or data that has an effective key greater than the last key processed.

 

With the use of relational sources, the records can be selected based on this effective date and only those records past a certain date are loaded into the warehouse. Views can also be created to perform the selection criteria. This way, the processing does not have to be incorporated into the mappings but is kept on the source component.

 

Placing the load strategy into the other mapping components is more flexible and controllable by the Data Integration developers and the associated metadata.

 

To compare the effective dates, you can use mapping variables to provide the previous date processed (see the description below). An alternative to Repository-maintained mapping variables is the use of control tables to store the dates and update the control table after each load.

 

Non-relational data can be filtered as records are loaded based upon the effective dates or sequenced keys. A Router transformation or filter can be placed after the Source Qualifier to remove old records.

 

Changed Data Based on Keys or Record Information

Data that is uniquely identified by keys can be sourced according to selection criteria. For example, records that contain primary keys or alternate keys can be used to determine if they have already been entered into the data warehouse. If they exist, you can also check to see if you need to update these records or discard the source record.

 

It may be possible to perform a join with the target tables in which new data can be selected and loaded into the target. It may also be feasible to lookup in the target to see if the data exists.

 

Target-Based Load Strategies

  • Loading directly into the target. Loading directly into the target is possible when the data is going to be bulk loaded. The mapping is then responsible for error control, recovery, and update strategy.
  • Load into flat files and bulk load using an external loader. The mapping loads data directly into flat files. You can then invoke an external loader to bulk load the data into the target. This method reduces the load times (with less downtime for the data warehouse) and provides a means of maintaining a history of data being loaded into the target. Typically, this method is only used for updates into the warehouse.
  • Load into a mirror database. The data is loaded into a mirror database to avoid downtime of the active data warehouse. After data has been loaded, the databases are switched, making the mirror the active database and the active the mirror.

 

Using Mapping Variables

You can use a mapping variable to perform incremental loading. By referencing a date-based mapping variable in the Source Qualifier or join condition, it is possible to select only those rows with greater than the previously captured date (i.e., the newly inserted source data). However, the source system must have a reliable date to use.

 

The steps involved in this method are:

Step 1: Create mapping variable

In the Mapping Designer, choose Mappings > Parameters > Variables. Or, to create variables for a mapplet, choose Mapplet > Parameters > Variables in the Mapplet Designer.

 

Click Add and enter the name of the variable (i.e., $$INCREMENT DATE). In this case, make your variable a date/time. For the Aggregation option, select MAX.

 

In the same screen, state your initial value. This date is used during the initial run of the session and as such should represent a date earlier than the earliest desired data.  The date can use any one of these formats:

 

  • MM/DD/RR
  • MM/DD/RR HH24:MI:SS
  • MM/DD/YYYY
  • MM/DD/YYYY HH24:MI:SS

 

Step 2: Reference the mapping variable in the Source Qualifier

The select statement should look like the following:

 

Select * from table A

where

CREATE DATE > date(‘$$INCREMENT_DATE’. ‘MM-DD-YYYY HH24:MI:SS’)

 

Step 3: Refresh the mapping variable for the next session run using an Expression Transformation

 

Use an Expression transformation and the pre-defined variable functions to set and use the mapping variable.

 

In the expression transformation, create a variable port and use the SETMAXVARIABLE variable function to capture the maximum source date selected during each run.

 

SETMAXVARIABLE($$INCREMENT_DATE,CREATE_DATE)

 

CREATE_DATE in this example is the date field from the source that should be used to identify incremental rows.

 

You can use the variables in the following transformations:

  • Expression
  • Filter
  • Router
  • Update Strategy

 

As the session runs, the variable is refreshed with the max date value encountered between the source and variable. So, if one row comes through with 9/1/2004, then the variable gets that value. If all subsequent rows are LESS than that, then 9/1/2004 is preserved.

 

Note:  This behavior has no effect on the date used in the source qualifier. The initial select always contains the maximum data value encountered during the previous, successful session run.

 

When the mapping completes, the PERSISTENT value of the mapping variable is stored in the repository for the next run of your session. You can view the value of the mapping variable in the session log file.

    

The advantage of the mapping variable and incremental loading is that it allows the session to use only the new rows of data. No table is needed to store the max(date) since the variable takes care of it.

 

After a successful session run, the PowerCenter Integration Service saves the final value of each variable in the repository. So when you run your session the next time, only new data from the source system is captured. If necessary, you can override the value saved in the repository with a value saved in a parameter file.

 

Using PowerExchange Change Data Capture

PowerExchange (PWX) Change Data Capture (CDC) greatly simplifies the identification, extraction, and loading of change records. It supports all key mainframe and midrange database systems, requires no changes to the user application, uses vendor-supplied technology where possible to capture changes, and eliminates the need for programming or the use of triggers. Once PWX CDC collects changes, it places them in a “change stream” for delivery to PowerCenter. Included in the change data is useful control information, such as the transaction type (insert/update/delete) and the transaction timestamp. In addition, the change data can be made available immediately (i.e., in real time) or periodically (i.e., where changes are condensed).

 

The native interface between PowerCenter and PowerExchange is PowerExchange Client for PowerCenter (PWXPC). PWXPC enables PowerCenter to pull the change data from the PWX change stream if real-time consumption is needed or from PWX condense files if periodic consumption is required. The changes are applied directly. So if the action flag is “I”, the record is inserted. If the action flag is “U’, the record is updated. If the action flag is “D”, the record is deleted. There is no need for change detection logic in the PowerCenter mapping.

 

In addition, by leveraging “group source” processing, where multiple sources are placed in a single mapping, the PowerCenter session reads the committed changes for multiple sources in a single efficient pass, and in the order they occurred. The changes are then propagated to the targets, and upon session completion, restart tokens (markers) are written out to a PowerCenter file so that the next session run knows the point to extract from.

 

Tips for Using PWX CDC

  • After installing PWX, ensure the PWX Listener is up and running and that connectivity is established to the Listener. For best performance, the Listener should be co-located with the source system.

  • In the PWX Navigator client tool, use metadata to configure data access. This means creating data maps for the non-relational to relational view of mainframe sources (such as IMS and VSAM) and capture registrations for all sources (mainframe, Oracle, DB2, etc.). Registrations define the specific tables and columns desired for change capture. There should be one registration per source. Group the registrations logically, for example, by source database.

  • For an initial test, make changes in the source system to the registered sources. Ensure that the changes are committed.

  • Still working in PWX Navigator (and before using PowerCenter), perform Row Tests to verify the returned change records, including the transaction action flag (the DTL__CAPXACTION column) and the timestamp. Set the required access mode: CAPX for change and CAPXRT for real time. Also, if desired, edit the PWX extraction maps to add the Change Indicator (CI) column. This CI flag (Y or N) allows for field level capture and can be filtered in the PowerCenter mapping.

  • Use PowerCenter to materialize the targets (i.e., to ensure that sources and targets are in sync prior to starting the change capture process). This can be accomplished with a simple pass-through “batch” mapping. This same bulk mapping can be reused for CDC purposes, but only if specific CDC columns are not included, and by changing the session connection/mode.

  • Import the PWX extraction maps into Designer. This requires the PWXPC component. Specify the CDC Datamaps option during the import.

  • Use “group sourcing” to create the CDC mapping by including multiple sources in the mapping. This enhances performance because only one read/connection is made to the PWX Listener and all changes (for the sources in the mapping) are pulled at one time.

  • Keep the CDC mappings simple. There are some limitations; for instance, you cannot use active transformations. In addition, if loading to a staging area, store the transaction types (i.e., insert/update/delete) and the timestamp for subsequent processing downstream. Also, if loading to a staging area, include an Update Strategy transformation in the mapping with DD_INSERT or DD_UPDATE in order to override the default behavior and store the action flags.

  • Set up the Application Connection in Workflow Manager to be used by the CDC session. This requires the PWXPC component. There should be one connection and token file per CDC mapping/session. Set the UOW (unit of work) to a low value for faster commits to the target for real-time sessions. Specify the restart token location and file on the PowerCenter Integration Service (within the infa_shared directory) and specify the location of the PWX Listener.

  • In the CDC session properties, enable session recovery (i.e., set the Recovery Strategy to “Resume from last checkpoint”).

  • Use post-session commands to archive the restart token files for restart/recovery purposes. Also, archive the session logs.

     

     

Table of Contents

Success

Link Copied to Clipboard