Data warehousing incorporates very large volumes of data. The process of loading the warehouse in a reasonable timescale without compromising its functionality is extremely difficult. The goal is to create a load strategy that can minimize downtime for the warehouse and allow quick and robust data management.
As time windows shrink and data volumes increase, it is important to understand the impact of a suitable incremental load strategy. The design should allow data to be incrementally added to the data warehouse with minimal impact on the overall system. This Best Practice describes several possible load strategies.
Incremental aggregation is useful for applying incrementally-captured changes in the source to aggregate calculations in a session.
If the source changes only incrementally, and you can capture those changes, you can configure the session to process only those changes with each run. This allows the PowerCenter Integration Service to update the target incrementally, rather than forcing it to process the entire source and recalculate the same calculations each time you run the session.
If the session performs incremental aggregation, the PowerCenter Integration Service saves index and data cache information to disk when the session finishes. The next time the session runs, the PowerCenter Integration Service uses this historical information to perform the incremental aggregation. To utilize this functionality set the “Incremental Aggregation” Session attribute. For details see Chapter 24 in the Workflow Administration Guide.
Use incremental aggregation under the following conditions:
Do not use incremental aggregation in the following circumstances:
Some conditions that may help in making a decision on an incremental strategy include:
Data sources typically fall into the following possible scenarios:
After the sources are identified, you need to determine which records need to be entered into the warehouse and how. Here are some considerations:
There are four main strategies in mapping design that can be used as a method of comparison:
The simplest method for incremental loads is from flat files or a database in which all records are going to be loaded. This strategy requires bulk loads into the warehouse with no overhead on processing of the sources or sorting the source records.
Data can be loaded directly from the source locations into the data warehouse. There is no additional overhead produced in moving these sources into the warehouse.
This method involves data that has been stamped using effective dates or sequences. The incremental load can be determined by dates greater than the previous load date or data that has an effective key greater than the last key processed.
With the use of relational sources, the records can be selected based on this effective date and only those records past a certain date are loaded into the warehouse. Views can also be created to perform the selection criteria. This way, the processing does not have to be incorporated into the mappings but is kept on the source component.
Placing the load strategy into the other mapping components is more flexible and controllable by the Data Integration developers and the associated metadata.
To compare the effective dates, you can use mapping variables to provide the previous date processed (see the description below). An alternative to Repository-maintained mapping variables is the use of control tables to store the dates and update the control table after each load.
Non-relational data can be filtered as records are loaded based upon the effective dates or sequenced keys. A Router transformation or filter can be placed after the Source Qualifier to remove old records.
Data that is uniquely identified by keys can be sourced according to selection criteria. For example, records that contain primary keys or alternate keys can be used to determine if they have already been entered into the data warehouse. If they exist, you can also check to see if you need to update these records or discard the source record.
It may be possible to perform a join with the target tables in which new data can be selected and loaded into the target. It may also be feasible to lookup in the target to see if the data exists.
You can use a mapping variable to perform incremental loading. By referencing a date-based mapping variable in the Source Qualifier or join condition, it is possible to select only those rows with greater than the previously captured date (i.e., the newly inserted source data). However, the source system must have a reliable date to use.
The steps involved in this method are:
In the Mapping Designer, choose Mappings > Parameters > Variables. Or, to create variables for a mapplet, choose Mapplet > Parameters > Variables in the Mapplet Designer.
Click Add and enter the name of the variable (i.e., $$INCREMENT DATE). In this case, make your variable a date/time. For the Aggregation option, select MAX.
In the same screen, state your initial value. This date is used during the initial run of the session and as such should represent a date earlier than the earliest desired data. The date can use any one of these formats:
The select statement should look like the following:
Select * from table A
where
CREATE DATE > date(‘$$INCREMENT_DATE’. ‘MM-DD-YYYY HH24:MI:SS’)
Use an Expression transformation and the pre-defined variable functions to set and use the mapping variable.
In the expression transformation, create a variable port and use the SETMAXVARIABLE variable function to capture the maximum source date selected during each run.
SETMAXVARIABLE($$INCREMENT_DATE,CREATE_DATE)
CREATE_DATE in this example is the date field from the source that should be used to identify incremental rows.
You can use the variables in the following transformations:
As the session runs, the variable is refreshed with the max date value encountered between the source and variable. So, if one row comes through with 9/1/2004, then the variable gets that value. If all subsequent rows are LESS than that, then 9/1/2004 is preserved.
Note: This behavior has no effect on the date used in the source qualifier. The initial select always contains the maximum data value encountered during the previous, successful session run.
When the mapping completes, the PERSISTENT value of the mapping variable is stored in the repository for the next run of your session. You can view the value of the mapping variable in the session log file.
The advantage of the mapping variable and incremental loading is that it allows the session to use only the new rows of data. No table is needed to store the max(date) since the variable takes care of it.
After a successful session run, the PowerCenter Integration Service saves the final value of each variable in the repository. So when you run your session the next time, only new data from the source system is captured. If necessary, you can override the value saved in the repository with a value saved in a parameter file.
PowerExchange (PWX) Change Data Capture (CDC) greatly simplifies the identification, extraction, and loading of change records. It supports all key mainframe and midrange database systems, requires no changes to the user application, uses vendor-supplied technology where possible to capture changes, and eliminates the need for programming or the use of triggers. Once PWX CDC collects changes, it places them in a “change stream” for delivery to PowerCenter. Included in the change data is useful control information, such as the transaction type (insert/update/delete) and the transaction timestamp. In addition, the change data can be made available immediately (i.e., in real time) or periodically (i.e., where changes are condensed).
The native interface between PowerCenter and PowerExchange is PowerExchange Client for PowerCenter (PWXPC). PWXPC enables PowerCenter to pull the change data from the PWX change stream if real-time consumption is needed or from PWX condense files if periodic consumption is required. The changes are applied directly. So if the action flag is “I”, the record is inserted. If the action flag is “U’, the record is updated. If the action flag is “D”, the record is deleted. There is no need for change detection logic in the PowerCenter mapping.
In addition, by leveraging “group source” processing, where multiple sources are placed in a single mapping, the PowerCenter session reads the committed changes for multiple sources in a single efficient pass, and in the order they occurred. The changes are then propagated to the targets, and upon session completion, restart tokens (markers) are written out to a PowerCenter file so that the next session run knows the point to extract from.
After installing PWX, ensure the PWX Listener is up and running and that connectivity is established to the Listener. For best performance, the Listener should be co-located with the source system.
In the PWX Navigator client tool, use metadata to configure data access. This means creating data maps for the non-relational to relational view of mainframe sources (such as IMS and VSAM) and capture registrations for all sources (mainframe, Oracle, DB2, etc.). Registrations define the specific tables and columns desired for change capture. There should be one registration per source. Group the registrations logically, for example, by source database.
For an initial test, make changes in the source system to the registered sources. Ensure that the changes are committed.
Still working in PWX Navigator (and before using PowerCenter), perform Row Tests to verify the returned change records, including the transaction action flag (the DTL__CAPXACTION column) and the timestamp. Set the required access mode: CAPX for change and CAPXRT for real time. Also, if desired, edit the PWX extraction maps to add the Change Indicator (CI) column. This CI flag (Y or N) allows for field level capture and can be filtered in the PowerCenter mapping.
Use PowerCenter to materialize the targets (i.e., to ensure that sources and targets are in sync prior to starting the change capture process). This can be accomplished with a simple pass-through “batch” mapping. This same bulk mapping can be reused for CDC purposes, but only if specific CDC columns are not included, and by changing the session connection/mode.
Import the PWX extraction maps into Designer. This requires the PWXPC component. Specify the CDC Datamaps option during the import.
Use “group sourcing” to create the CDC mapping by including multiple sources in the mapping. This enhances performance because only one read/connection is made to the PWX Listener and all changes (for the sources in the mapping) are pulled at one time.
Keep the CDC mappings simple. There are some limitations; for instance, you cannot use active transformations. In addition, if loading to a staging area, store the transaction types (i.e., insert/update/delete) and the timestamp for subsequent processing downstream. Also, if loading to a staging area, include an Update Strategy transformation in the mapping with DD_INSERT or DD_UPDATE in order to override the default behavior and store the action flags.
Set up the Application Connection in Workflow Manager to be used by the CDC session. This requires the PWXPC component. There should be one connection and token file per CDC mapping/session. Set the UOW (unit of work) to a low value for faster commits to the target for real-time sessions. Specify the restart token location and file on the PowerCenter Integration Service (within the infa_shared directory) and specify the location of the PWX Listener.
In the CDC session properties, enable session recovery (i.e., set the Recovery Strategy to “Resume from last checkpoint”).
Use post-session commands to archive the restart token files for restart/recovery purposes. Also, archive the session logs.