Increasingly, a number of customers find that their Data Integration implementation must be available 24x7 without interruption or failure. This Best Practice describes the High Availability (HA) capabilities incorporated in PowerCenter and explains why it is critical to address both architectural (e.g., systems, hardware, firmware) and procedural (e.g., application design, code implementation, session/workflow features) recovery with HA.
One of the common requirements of high volume data environments with non-stop operations is to minimize the risk exposure from system failures. PowerCenter’s High Availability Option provides failover, recovery and resilience for business critical, always-on data integration processes. When considering HA recovery, be sure to explore the following two components of HA that exist on all enterprise systems:
External resilience has to do with the integration and specification of domain name servers, database servers, FTP servers, network access servers in a defined, tested 24x7 configuration. The nature of Informatica’s data integration setup places it at many interface points in system integration. Before placing and configuring PowerCenter within an infrastructure that has an HA expectation, the following questions should be answered:
In an HA PowerCenter environment key elements to keep in mind are:
Internal resilience occurs within the PowerCenter environment among PowerCenter services, the PowerCenter Client tools, and other client applications such as pmrep and pmcmd. Internal resilience can be configured at the following levels:
Restart and Failover has to do with the Domain Services (Integration and Repository). If these services are not highly available, the scheduling, dependencies (e.g., touch files, ftp, etc.) and artifacts of the ETL process cannot be highly available.
If a service process becomes unavailable, the Service Manager can restart the process or fail it over to a backup node based on the availability of the node. When a service process restarts or fails over, the service restores the state of operation and begins recovery from the point of interruption.
Backup nodes can be configured for services with the high availability option. If an application service is configured to run on primary and backup nodes, one service process can run at a time. The following situations describe restart and failover for an application service:
If a service process fails over to a backup node, it does not fail back to the primary node when the node becomes available. The service process can be disabled on the backup node to cause it to fail back to the primary node.
Recovery is the completion of operations after an interrupted service is restored. When a service recovers, it restores the state of operation and continues processing the job from the point of interruption. The state of operation for a service contains information about the service process. The PowerCenter services include the following states of operation:
When designing a system that has HA recovery as a core component, be sure to include architectural and procedural recovery.
Architectural recovery for a PowerCenter domain involves the Service Manager, Repository Service and Integration Service restarting in a complete, sustainable and traceable manner. If the Service Manager and Repository Service recover, but the Integration Service cannot recover, the restart is not successful and has little value to a production environment. Field experience with PowerCenter has yielded these key items in planning a proper recovery upon a systemic failure:
Procedural recovery is supported with many features of PowerCenter. Consider the following very simple mapping that might run in production for many ETL applications. Suppose there is a situation where the ftp server sending this ff_customer file is inconsistent. Many times the file is not there, but the processes depending on this must always run. The process is always insert only. You do not want the succession of ETL that follows this small process to fail - they can run to customer_stg with current records only. This setting in the Workflow Manager, Session, Properties would fit your need
Since it is not critical the ff_customer records run each time, record the failure but continue the process. Now say the situation has changed. Sessions are failing on a PowerCenter server due to target database timeouts. A requirement is given that the session must recover from this:
Resuming from last checkpoint restarts the process from its prior commit, allowing no loss of ETL work. To finish this second case, consider three basic items on the workflow side when the HA Option is implemented. An Integration Service in an HA environment can only recover those workflows marked with “Enable HA recovery”. For all critical workflows, this should be considered. For a mature set of ETL code running in QA or Production, the following workflow property may be considered:
This would automatically recover tasks from where they failed in a workflow upon an application or system wide failure. Consider carefully the use of this feature, however. Remember, automated restart of critical ETL processes without interaction can have vast unintended side effects. For instance, if a database alias or synonym was dropped, all ETL targets may now refer to different objects than the original intent. Only PowerCenter environments with HA, mature production support practices, and a complete operations manual per Velocity, should expect complete recovery with this feature.
In an HA environment, certain components of the Domain can go offline while the Domain stays up to execute ETL jobs. This is a time to use the “Suspend On Error” feature from the General tab of Workflow settings. The backup Integration Server would then pick up this workflow and resume processing based on the resume settings of this workflow:
A variety of HA features exist in PowerCenter. Specifically, they include:
First, proceed from an assumption that nodes have been provided such that a basic HA configuration of PowerCenter can take place. The HA solution should be completed with a shared file system supported by Informatica GCS. Your first step would always be implementing and thoroughly exercising a shared file system. Now, let’s address the options in order:
You must have the HA option on the license key for this to be available on install. Note that once the base PowerCenter install is configured, all nodes are available from the Admin Console->Domain->Integration Services->Grid/Node Assignments. With the HA (Primary/Backup) install complete, Integration Services are then displayed with both “P” and “B” in a configuration, with the current operating node highlighted:
If a failure were to occur on this HA configuration, the Integration Service would poll the Domain for another Gateway Node, and then assign the Integration Service over to that Node. Then the “B” button would highlight showing this Node as running the Integration Service.
A vital component of configuring the Integration Service for HA is making sure the Integration Service files are stored in a shared persistent environment. The paths for Integration Service files must be specified for each Integration Service process. Examples of Integration Service files include run-time files, state of operation files, and session log files.
Each Integration Service process uses run-time files to process workflows and sessions. If an Integration Service is configured to run on a grid or to run on backup nodes, the run-time files must be stored in a shared location. Each node must have access to the run-time files used to process a session or workflow. This includes files such as parameter files, cache files, input files, and output files.
State of operation files must be accessible by all Integration Service processes. When an Integration Service is enabled, it creates files to store the state of operations for the service. The state of operations includes information such as the active service requests, scheduled tasks, and completed and running processes. If the service fails, the Integration Service can restore the state and recover operations from the point of interruption.
All Integration Service processes associated with an Integration Service must use the same shared location. However, each Integration Service can use a separate location. By default, the installation program creates a set of Integration Service directories in the server\infa_shared directory. The shared location for these directories can be set by configuring the process variable $PMRootDir to point to the same location for each Integration Service process. The key HA concern of this is $PMRootDir should be on the highly-available clustered file system mentioned above.
The Grid Option provides implicit HA since the Integration Service can be configured as active/active to provide redundancy. The Server Grid option should be included on the license key for this to be available upon install. In configuring the $PMRootDir files for the Integration Service, use the method described above. Also, in Admin Console->Domain->Properties->Log and Gateway Configuration, the log and directory paths should be on the clustered file system mentioned above. A grid must be created before it can be used in a Power Center domain. Be sure to remember these key points:
If you have a large volume of disparate hardware, it is certainly possible to make perhaps two grids centered on two different operating systems. In either case, the performance of your clustered file system is going to affect the performance of your server grid, and should be considered as part of your performance/maintenance strategy.
You must have the HA option on the license key for this. In order to configure Repository Service for HA, use an incremental implementation of HA from a tested base configuration. After ensuring that the initial Repository Service settings (e.g., resilience timeout, codepage, connection timeout) and the DBMS repository containing the metadata are running and stable, make the second node Repository Backup. When it is running, go to Domain->Repository->Properties->Node Assignments->Edit and the browser window displays:
Click “OK” and the Repository Service is now configured in a Primary/Backup setup for the domain. To ensure the P/B setting, test the following elements of the configuration:
Note: Remember that when a node is taken offline, you cannot access Admin Console from that node.
A script should be used with the High Availability Option that will check all the Informatica Services in the domain as well as the domain itself. If any of the services are down the script can bring them back up. To implement the HA Option using a script, the Domain, Repository and Integration details need to be provided as input to the script; and the script needs to be scheduled to run at regular intervals. The script can be developed with eight functions (and one main function to check and bring up the services).
A script can be implemented in any environment by providing input in the <Input Environment Variables> section only. Comments have been provided for each function to make them easy to understand. Below is a brief description of the eight functions:
print_msg: |
Called to print output to the I/O and also writes to the log file. |
domain_service_lst: |
Accepts the list of services to be checked for in the domain. |
check_service: |
Calls the service manager, repository, and the integration functions internally to check if they are up and running. |
check_repo_service: |
Checks if the repository is up or down. If it is down it calls another function to bring it up. |
enable_repo_service: |
Called to enable the repository service. |
check_int_service: |
Checks if the integration is up or down. If it is down it calls another function to bring it up. |
enable_int_service: |
Called to enable the integration service. |
disable_int_service: |
Called to disable the integration service. |