Performance tuning the loading of relational tables, XML, COBOL and standard flat files benefits from partitioning. Partitioning alone is not enough and strategies for partitioning with close coordination between sessions, partitions and CPUs are required.
On hardware systems that are under-utilized, performance can be improved by processing partitioned data sets in parallel in multiple threads of the same session instance running on the Informatica PowerCenter Integration Service (PCIS) node. Parallel execution may impair performance on over-utilized systems or systems with smaller I/O capacity.
In addition to hardware, consider these other factors when determining if a session is an ideal candidate for partitioning: source and target database setup, target type, mapping design, and certain assumptions that are explained in the following paragraphs. Use the Workflow Manager client tool to implement session partitioning.
The following assumptions pertain to the source and target systems of a session that is a candidate for partitioning. These factors can help to maximize the benefits that can be achieved through partitioning.
First, determine if you should partition your session. Parallel execution benefits systems that have the following characteristics:
This gives the high-level information of the bottleneck point/points. In order to do this, open the session log and look for messages starting with PETL_ under the RUN INFO FOR TGT LOAD ORDER GROUP section. These PETL messages give the following details against the reader, transformation, and writer threads:
If the PowerCenter servers are running on virtual environments, the resources reported by the guest OS maybe different from the actual resources. The actual resources may also vary at runtime. Dynamic allocation of CPUs may mean that resources needed at a later runtime might not actually be available even if they are reported. For virtual environments the administrators must be consulted to determine actual resources available and any changes that take place during runtime.
To determine if physical CPUs on a real environment are under-utilized, check the CPU usage of the PCIS nodes. The column ID displays the percentage utilization of CPU idling during the specified interval without any I/O wait. If there are CPU cycles available (i.e., twenty percent or more idle time), then this session's performance may be improved by adding a partition.
To determine the I/O statistics:
If too much memory is allocated to your session, you will receive a memory allocation error. Check to see that as much memory is used as possible. If the session is paging, increase the memory. To determine if the session is paging:
Set up partitions once it has been determined that partitioning is practical.
PowerCenter provides increased control of the pipeline threads. Session performance can be improved by adding partitions at various pipeline partition points. A partition type must be specified for a pipeline. The partition type determines how the PCIS redistributes data across partition points. The Workflow Manager allows the following partition types to be defined:
The PCIS distributes data evenly among all partitions. Use round-robin partitioning when to distribute rows evenly when data does not need to be grouped among partitions.
In a pipeline that reads data from file sources of different sizes, use round-robin partitioning. For example, consider a session based on a mapping that reads data from three flat files of different sizes.
In this scenario, Informatica recommends setting a partition point after the Source Qualifier and with a partition type of round-robin. The PCIS distributes the data so that each partition processes approximately one third of the data.
The PCIS applies a hash function to a partition key to group data among partitions.
Use hash partitioning to ensure that the PCIS processes groups of rows with the same partition key in the same partition. For example, in a scenario where items are sorted by item ID, but the number of items that have a particular ID number is not known. When the hash auto-keys option is used, the PCIS uses all grouped or sorted ports as the partition key. If the hash user keys option is selected, the number of ports is specified to form the partition key.
An example of this type of partitioning is, when using aggregator transformation and need to ensure that groups of data based on a primary key are processed in the same partition to ensure that aggregation is computed correctly.
Specify one or more ports to form a compound partition key for a source or target. The PCIS then passes data to each partition depending on the ranges you specified for each port.
Use key range partitioning when the sources or targets in the pipeline are partitioned by key range. Refer to the Workflow Administration Guide for further directions on setting up key range partitions.
For example, with key range partitioning set at End range = 2020, the PCIS passes in data where values are less than 2020. Similarly, for Start range = 2020, the PCIS passes in data where values are equal to greater than 2020. Null values or values that may not fall in either partition are passed through the first partition.
In this type of partitioning, the PCIS passes all rows at one partition point to the next partition point without redistributing them.
Use pass-through partitioning when an additional pipeline stage is needed to improve performance, but distribution of data across partitions cannot or should not be changed. The Data Transformation Manager spawns a master thread on each session run, which in turn creates three threads (reader, transformation, and writer threads) by default. Each of these threads can, at the most, process one data set at a time and hence, three data sets simultaneously. If there are complex transformations in the mapping, the transformation thread may take a longer time than the other threads, which can slow data throughput.
It is advisable to define partition points at these transformations. This creates another pipeline stage and reduces the overhead of a single transformation thread.
Once these factors have been considered a partitioning strategy has been selected, The iterative process of adding partitions can be started. Continue adding partitions to the session until the desired performance threshold or observe diminished performance gains in relation to the added partition or performance degrades.
When partitioning is finished, monitor the session to see if the partition is degrading or improving session performance. If session performance has improved try adding another partition.
Session on Grid (provides the ability to run a session on multi-node integration services. This is most suitable for large-size sessions. For small and medium size sessions, it is more practical to distribute whole sessions to different nodes using Workflow on Grid. Session on Grid leverages existing partitions of a session by executing threads in multiple DTMs. The Log service can be used to get the cumulative log.
With the dynamic partitioning option, the partition information can be configured so that the Integration Service determines the number of partitions to create at run time.
The Integration Service scales the number of session partitions at run time based on factors such as source database partitions or the number of nodes in a grid.
Dynamic partitioning is also called parameterized partitioning because a single parameter can determine the number of partitions. If the volume of data grows or more CPUs are added, partitioning may need to be adjusted so the session runs with optimal performance.
With the Session on Grid option, more partitions can be added when more resources are available. Also the number of partitions in a session can be tied to partitions in the database to reduce maintenance of PowerCenter partitions and use the database partitioning.