Session and Data Partitioning

Success

Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs

Success

Success Accelerators

Accelerate your Purchase to Value engaging with Informatica Architects for Customer Success

My Engagements

All your Engagements at one place
Communities

A collaborative platform to connect and grow with like-minded Informaticans across the globe

Communities

Product Communities

Connect and collaborate with Informatica experts and champions

Discussions

Have a question? Start a Discussion and get immediate answers you are looking for

User Groups

Customer-organized groups that meet online and in-person. Join today to network, share ideas, and get tips on how to get the most out of Informatica

Get Started

Community Guidelines
Knowledge Center

Troubleshooting documents, product guides, how to videos, best practices, and more

Knowledge Center

Knowledge Base

One-stop self-service portal for solutions, FAQs, Whitepapers, How Tos, Videos, and more

Support TV

Video channel for step-by-step instructions to use our products, best practices, troubleshooting tips, and much more

Documentation

Information library of the latest product documents

Velocity (Best Practices)

Best practices and use cases from the Implementation team
Learn

Rich resources to help you leverage full capabilities of our products

Learn

Trainings

Role-based training programs for the best ROI

Certifications

Get certified on Informatica products. Free, Foundation, or Professional

Product Learning Paths

Free and unlimited modules based on your expertise level and journey

Experience Lounge

Self-guided, intuitive experience platform for outcome-focused product capabilities and use cases
Resources

Library of content to help you leverage the best of Informatica products

Resources

Tech Tuesdays Webinars

Most popular webinars on product architecture, best practices, and more

Product Availability Matrix

Product Availability Matrix statements of Informatica products

SupportFlash

Monthly support newsletter

Support Documents

Informatica Support Guide and Statements, Quick Start Guides, and Cloud Product Description Schedule

Product Lifecycle

End of Life statements of Informatica products

Ideas

Events

Change Request Tracking

Marketplace

| Sign up

Velocity
Strategy

Strategy

Data Strategy

Centers of Excellence

Enterprise Data Governance

Enterprise Architecture

Program & Change Management
Solutions

Solutions

Cloud Data Warehouse & Data Lake

Data Lake

Data Warehouse Modernization

Analytics Modernization

Application Integration

360 Engagement

Multidomain MDM

Customer 360 SaaS

Product 360

Supplier 360

Reference 360

Data Governance & Privacy

Data Catalog & Metadata Management

Data Privacy

Regulatory Compliance

Data Quality

Data Access and Provisioning
Stages

Stages

Cloud Data Warehouse & Data Lake

360 Engagement

Data Governance & Privacy

Following a rigorous methodology is key to delivering customer satisfaction and expanding analytics use cases across the business.
More
- Success
  
  Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs
  
  Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs
  
  Success Accelerators
  
  Accelerate your Purchase to Value engaging with Informatica Architects for Customer Success
  
  My Engagements
  
  All your Engagements at one place
- Communities
  
  A collaborative platform to connect and grow with like-minded Informaticans across the globe
  
  A collaborative platform to connect and grow with like-minded Informaticans across the globe
  
  Product Communities
  
  Connect and collaborate with Informatica experts and champions
  
  Discussions
  
  Have a question? Start a Discussion and get immediate answers you are looking for
  
  User Groups
  
  Customer-organized groups that meet online and in-person. Join today to network, share ideas, and get tips on how to get the most out of Informatica
  
  Get Started
  
  Community Guidelines
- Knowledge Center
  
  Troubleshooting documents, product guides, how to videos, best practices, and more
  
  Troubleshooting documents, product guides, how to videos, best practices, and more
  
  Knowledge Base
  
  One-stop self-service portal for solutions, FAQs, Whitepapers, How Tos, Videos, and more
  
  Support TV
  
  Video channel for step-by-step instructions to use our products, best practices, troubleshooting tips, and much more
  
  Documentation
  
  Information library of the latest product documents
  
  Velocity (Best Practices)
  
  Best practices and use cases from the Implementation team
- Learn
  
  Rich resources to help you leverage full capabilities of our products
  
  Rich resources to help you leverage full capabilities of our products
  
  Trainings
  
  Role-based training programs for the best ROI
  
  Certifications
  
  Get certified on Informatica products. Free, Foundation, or Professional
  
  Product Learning Paths
  
  Free and unlimited modules based on your expertise level and journey
  
  Experience Lounge
  
  Self-guided, intuitive experience platform for outcome-focused product capabilities and use cases
- Resources
  
  Library of content to help you leverage the best of Informatica products
  
  Library of content to help you leverage the best of Informatica products
  
  Tech Tuesdays Webinars
  
  Most popular webinars on product architecture, best practices, and more
  
  Product Availability Matrix
  
  Product Availability Matrix statements of Informatica products
  
  SupportFlash
  
  Monthly support newsletter
  
  Support Documents
  
  Informatica Support Guide and Statements, Quick Start Guides, and Cloud Product Description Schedule
  
  Product Lifecycle
  
  End of Life statements of Informatica products
  
  Ideas
  
  Events
  
  Change Request Tracking
  
  Marketplace

Last Updated Date May 25, 2021 |

Stages Cloud Data Warehouse & Data Lake Best Practice

Challenge

Performance tuning the loading of relational tables, XML, COBOL and standard flat files benefits from partitioning. Partitioning alone is not enough and strategies for partitioning with close coordination between sessions, partitions and CPUs are required.

Description

On hardware systems that are under-utilized, performance can be improved by processing partitioned data sets in parallel in multiple threads of the same session instance running on the Informatica PowerCenter Integration Service (PCIS) node. Parallel execution may impair performance on over-utilized systems or systems with smaller I/O capacity.

In addition to hardware, consider these other factors when determining if a session is an ideal candidate for partitioning: source and target database setup, target type, mapping design, and certain assumptions that are explained in the following paragraphs. Use the Workflow Manager client tool to implement session partitioning.

Assumptions

The following assumptions pertain to the source and target systems of a session that is a candidate for partitioning. These factors can help to maximize the benefits that can be achieved through partitioning.

Indexing has been implemented on the partition key when using a relational source.
Source files are located on the same physical machine as the PCIS process when partitioning flat files, COBOL, and XML, to reduce network overhead and delay.
All possible constraints are dropped or disabled on relational targets.
All possible indexes are dropped or disabled on relational targets.
Table spaces and database partitions are properly managed on the target system.
Target files are written to same physical machine that hosts the PCIS process in order to reduce network overhead and delay.
Oracle External Loaders are utilized when possible.

First, determine if you should partition your session. Parallel execution benefits systems that have the following characteristics:

Check idle time and busy percentage for each thread

This gives the high-level information of the bottleneck point/points. In order to do this, open the session log and look for messages starting with PETL_ under the RUN INFO FOR TGT LOAD ORDER GROUP section. These PETL messages give the following details against the reader, transformation, and writer threads:

Total Run Time
Total Idle Time
Busy Percentage

Under-Utilized or Intermittently-Used CPUs

If the PowerCenter servers are running on virtual environments, the resources reported by the guest OS maybe different from the actual resources. The actual resources may also vary at runtime. Dynamic allocation of CPUs may mean that resources needed at a later runtime might not actually be available even if they are reported. For virtual environments the administrators must be consulted to determine actual resources available and any changes that take place during runtime.

To determine if physical CPUs on a real environment are under-utilized, check the CPU usage of the PCIS nodes. The column ID displays the percentage utilization of CPU idling during the specified interval without any I/O wait. If there are CPU cycles available (i.e., twenty percent or more idle time), then this session's performance may be improved by adding a partition.

Windows 2000/2003 - check the task manager performance tab.
UNIX - type VMSTAT 1 10 on the command line.

Sufficient I/O

To determine the I/O statistics:

Windows 2000/2003 - check the task manager performance tab.
UNIX - type IOSTAT on the command line. The column %IOWAIT displays the percentage of CPU time spent idling while waiting for I/O requests. The column %idle displays the total percentage of the time that the CPU spends idling (i.e., the unused capacity of the CPU.)

Sufficient Memory

If too much memory is allocated to your session, you will receive a memory allocation error. Check to see that as much memory is used as possible. If the session is paging, increase the memory. To determine if the session is paging:

Windows 2000/2003 - check the task manager performance tab.
UNIX - type VMSTAT 1 10 on the command line. PI displays number of pages swapped in from the page space during the specified interval. PO displays the number of pages swapped out to the page space during the specified interval. If these values indicate that paging is occurring, it may be necessary to allocate more memory, if possible.

Set up partitions once it has been determined that partitioning is practical.

Partition Types

PowerCenter provides increased control of the pipeline threads. Session performance can be improved by adding partitions at various pipeline partition points. A partition type must be specified for a pipeline. The partition type determines how the PCIS redistributes data across partition points. The Workflow Manager allows the following partition types to be defined:

Round-Robin Partitioning

The PCIS distributes data evenly among all partitions. Use round-robin partitioning when to distribute rows evenly when data does not need to be grouped among partitions.

In a pipeline that reads data from file sources of different sizes, use round-robin partitioning. For example, consider a session based on a mapping that reads data from three flat files of different sizes.

Source file 1: 100,000 rows
Source file 2: 5,000 rows
Source file 3: 20,000 rows

In this scenario, Informatica recommends setting a partition point after the Source Qualifier and with a partition type of round-robin. The PCIS distributes the data so that each partition processes approximately one third of the data.

Hash Partitioning

The PCIS applies a hash function to a partition key to group data among partitions.

Use hash partitioning to ensure that the PCIS processes groups of rows with the same partition key in the same partition. For example, in a scenario where items are sorted by item ID, but the number of items that have a particular ID number is not known. When the hash auto-keys option is used, the PCIS uses all grouped or sorted ports as the partition key. If the hash user keys option is selected, the number of ports is specified to form the partition key.

An example of this type of partitioning is, when using aggregator transformation and need to ensure that groups of data based on a primary key are processed in the same partition to ensure that aggregation is computed correctly.

Key Range Partitioning

Specify one or more ports to form a compound partition key for a source or target. The PCIS then passes data to each partition depending on the ranges you specified for each port.

Use key range partitioning when the sources or targets in the pipeline are partitioned by key range. Refer to the Workflow Administration Guide for further directions on setting up key range partitions.

For example, with key range partitioning set at End range = 2020, the PCIS passes in data where values are less than 2020. Similarly, for Start range = 2020, the PCIS passes in data where values are equal to greater than 2020. Null values or values that may not fall in either partition are passed through the first partition.

Pass-through Partitioning

In this type of partitioning, the PCIS passes all rows at one partition point to the next partition point without redistributing them.

Use pass-through partitioning when an additional pipeline stage is needed to improve performance, but distribution of data across partitions cannot or should not be changed. The Data Transformation Manager spawns a master thread on each session run, which in turn creates three threads (reader, transformation, and writer threads) by default. Each of these threads can, at the most, process one data set at a time and hence, three data sets simultaneously. If there are complex transformations in the mapping, the transformation thread may take a longer time than the other threads, which can slow data throughput.

It is advisable to define partition points at these transformations. This creates another pipeline stage and reduces the overhead of a single transformation thread.

Once these factors have been considered a partitioning strategy has been selected, The iterative process of adding partitions can be started. Continue adding partitions to the session until the desired performance threshold or observe diminished performance gains in relation to the added partition or performance degrades.

Tips for Efficient Session and Data Partitioning

Add one partition at a time. To best monitor performance, add one partition at a time, and note the session settings before adding additional partitions. Refer to Workflow Administrator Guide, for more information on restrictions on the number of partitions.
Set DTM buffer memory. For a session with n partitions, set this value to at least n times the original value for the non-partitioned session.
Set cached values for sequence generator. For a session with n partitions, there is generally no need to use the Number of Cached Values property of the sequence generator. If you must set this value to a value greater than zero, make sure it is at least n times the original value for the non-partitioned session.
Partition the source data evenly. The source data should be partitioned into equal sized chunks for each partition.
Partition tables. A notable increase in performance can also be realized when the actual source and target tables are partitioned. Work with the DBA to discuss the partitioning of source and target tables, and the setup of the tablespaces.
Match PowerCenter Session Partitioning with Database Partitioning. The partitioning performs best when the session partitions match the database partitions.
Consider using external loader. As with any session, using an external loader may increase session performance. Only Oracle external loaders can be used for partitioning. Refer to the Session and Server Guide for more information on using and setting up the Oracle external loader for partitioning.
Write throughput. Check the session statistics to see if the write throughput has increased.
Paging. Check to see if the session is now causing the system to page. When a session with cached lookups is partitioned DTM memory may need to be increased to handle the lookup caches. When a source that uses a static lookup cache is partitioned the PCIS creates one memory cache for each partition and one disk cache for each transformation. Thus, memory requirements grow for each partition. If the memory is not increased up, the system may start paging to disk reducing performance.

When partitioning is finished, monitor the session to see if the partition is degrading or improving session performance. If session performance has improved try adding another partition.

Session on Grid and Partitioning Across Nodes

Session on Grid (provides the ability to run a session on multi-node integration services. This is most suitable for large-size sessions. For small and medium size sessions, it is more practical to distribute whole sessions to different nodes using Workflow on Grid. Session on Grid leverages existing partitions of a session by executing threads in multiple DTMs. The Log service can be used to get the cumulative log.

Dynamic Partitioning

With the dynamic partitioning option, the partition information can be configured so that the Integration Service determines the number of partitions to create at run time.

The Integration Service scales the number of session partitions at run time based on factors such as source database partitions or the number of nodes in a grid.

Dynamic partitioning is also called parameterized partitioning because a single parameter can determine the number of partitions. If the volume of data grows or more CPUs are added, partitioning may need to be adjusted so the session runs with optimal performance.

With the Session on Grid option, more partitions can be added when more resources are available. Also the number of partitions in a session can be tied to partitions in the database to reduce maintenance of PowerCenter partitions and use the database partitioning.

Types of Dynamic Partitioning

Based on number of partitions: This dynamic partitioning option sets the partition number to that you define in Number of partition attributes. For example the ‘Number of partitions = 4’ is defined and then the Integration service creates 4 partitions at run time. The parameter $DynamicPartitionCount session parameter can be used to set the number of partitions.
Based on number of nodes in grid. This option sets the partitions to the number of nodes in the grid running the session. If this option is configured for sessions that do not run on a grid, then the session runs in just one partition.
Based on source partitioning. Determines the number of partitions using database partition information. The number of partitions is the maximum of the number of partitions at the source
Based on number of CPUs. This option sets the number of partitions equal to the number of CPUs on the node that prepares the session. If the session is configured to run on a grid, dynamic partitioning sets the number of partitions equal to the number of CPUs on the node that prepares the session multiplied by the number of nodes in the grid.

Rules and Guidelines for Dynamic Partitioning

Dynamic partitioning uses the same connection for each partition.
Dynamic partitioning cannot be used with XML sources and targets.
You dynamic partitioning cannot be used with the Debugger in the PowerCenter Designer.
Sessions that use SFTP will fail if dynamic partitioning is enabled.
The session will be invalidated if you dynamic partitioning is used with manual partitioning.
The session fails if a parameter other than $DynamicPartitionCount is used to set the number of partitions.

Restrictions for Defining Transformations as Partitioning Points

Transformations that cannot be used to create partition points:
- Source
- Sequence Generator
- Unconnected transformations
Partition points cannot be deleted from:
- Source Qualifier
- Normalizer
- Target transformations
Flatfile sources will always have Pass-through partition type
Joiner transformation can have Hash Auto keys and Pass-through partition types
Lookup can have any partition type except DB Partitions
Database Partitioning cannot be used for Source Qualifier transformation when the session is configured with source / user based commit intervals or constraint based loading
Multiple thread reading is not allowed for XML source and FTP files
Database partitioning cannot be used for Target transformations when
- the session is configured with source / user based commit or constraint based loading
- the target tables are partitioned based on key range