• Success
    Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs
    Success
    Accelerate your Purchase to Value engaging with Informatica Architects for Customer Success
    All your Engagements at one place
  • Communities
    A collaborative platform to connect and grow with like-minded Informaticans across the globe
    Communities
    Connect and collaborate with Informatica experts and champions
    Have a question? Start a Discussion and get immediate answers you are looking for
    Customer-organized groups that meet online and in-person. Join today to network, share ideas, and get tips on how to get the most out of Informatica
  • Knowledge Center
    Troubleshooting documents, product guides, how to videos, best practices, and more
    Knowledge Center
    One-stop self-service portal for solutions, FAQs, Whitepapers, How Tos, Videos, and more
    Video channel for step-by-step instructions to use our products, best practices, troubleshooting tips, and much more
    Information library of the latest product documents
    Best practices and use cases from the Implementation team
  • Learn
    Rich resources to help you leverage full capabilities of our products
    Learn
    Role-based training programs for the best ROI
    Get certified on Informatica products. Free, Foundation, or Professional
    Free and unlimited modules based on your expertise level and journey
    Self-guided, intuitive experience platform for outcome-focused product capabilities and use cases
  • Resources
    Library of content to help you leverage the best of Informatica products
    Resources
    Most popular webinars on product architecture, best practices, and more
    Product Availability Matrix statements of Informatica products
    Monthly support newsletter
    Informatica Support Guide and Statements, Quick Start Guides, and Cloud Product Description Schedule
    End of Life statements of Informatica products
Last Updated Date May 25, 2021 |

Challenge

Informatica PowerCenter comes with a built-in feature that permits the use of user-defined SQL queries through ‘SQL Query’ and ‘Lookup Override’ options available within Source Qualifier and Lookup transformations respectively. This feature is useful in some scenarios. However, adding all business logic to SQL (such as data transformations, sub-queries, or case statements) is not always the best way to leverage PowerCenter’s capabilities. SQL overrides hide the traceability of business rules, create maintenance complexity, constrain the ability to tune PowerCenter mappings for performance (since all the work is being done at the underlying database level), are rarely portable among different DBMS and constrain the ability to work with source systems other than a relational DBMS.

This Best Practice document provides general guidelines for Informatica PowerCenter users on how SQL overrides can in many cases be avoided without compromising performance.

Description

There are quite a few typical use cases for SQL Overrides. While it is not always possible to avoid SQL Overrides completely, there are many cases where the use of Source Filters or SQL Overrides does not provide a real benefit (in particular in terms of performance and maintainability). In these cases, it is advisable to look for alternative implementations.

This document describes situations where SQL Overrides are typically leveraged, but where it makes sense to at least try alternative approaches for implementation.

Below are four common situations where SQL Overrides or Source Filters are used. This list briefly describes these use cases which will be analyzed and treated in more detail in subsequent sections.

  • Self-Join: Here two typical cases can be distinguished, but both have one thing in common, they reference the source data to retrieve some aggregated value which is then associated with all original data records.
  • Subset Inclusion: The SQL Override contains one (innermost) sub-SELECT returning a small subset of data from one table or a set of tables; then every following SELECT refers to this in order to join the subset with some other table(s).
  • Complex Lookup Logic: A Lookup transformation with shared cache is used several times within a mapping and the lookup query contains some complex logic.
  • Recursively Stored Data (for example, in Oracle often extracted via CONNECT BY PRIOR).

Common Arguments

For a variety of reasons, SQL Overrides are commonly used throughout the Informatica PowerCenter world. Below are a few of the commonly used arguments for their widespread use:

  • In the main SELECT clause, some reference to an inner sub-SELECT is needed. In PowerCenter it is not possible (or at least only with some challenges) to do an inner SELECT to be used as a reference in the main SELECT statement in order to avoid SQL Overrides.
  • DBMS machines are always much more powerful than PowerCenter machines.
  • It is more efficient to filter records at the source than to feed unnecessary records into a mapping.

It is true that PowerCenter does not build sub-SELECT statements on its own without further effort (primarily performed by the add-on option named Pushdown Optimization). However, sub-queries always put additional burden on the DBMS machine. Not every DBMS can cope with moderately or highly complex queries equally well. It is almost always advisable to try other approaches.

It is often assumed that DBMS servers are equipped with more and faster CPUs, more memory, and faster hard disks than the PowerCenter servers connecting to these databases. This was fairly common when PowerCenter was a 32-bit application and many DBMS were available as 64-bit applications. However, this assumption is no longer valid in many cases (not only because PowerCenter is no longer available as a 32-bit application on UNIX platforms). Informatica highly recommends asking the customer how these machines are equipped before making any assumptions about which task runs faster on which machine.

Case 1 - Self-Join

SQL Overrides Returning an Aggregated Value for Each Group of Records

Assume data is logically “grouped” by an ID. A value needs to be calculated for every “group” of records and finally this aggregated value has to be assigned to every record of the same “group” (or some details from the record having this aggregated value have to be assigned to all other records of the group).

One example would be a company with many manufacturing subsidiaries all over the world where all staffing costs and gross revenues per subsidiary and department are calculated. Then, for every single department of every subsidiary, the (positive or negative) relative difference to the average of all subsidiaries is retrieved.

In classic SQL based applications sub-SELECT statements would gather the detail records for the one record per group holding / yielding the aggregated value. Then this “leading” record per group would be re-joined with the original data.

In many cases, this step can be implemented leveraging an Aggregator with Sorted Input, minimizing cache files and maximizing processing speed. Because the data is still sorted by group ID(s), this join can be executed using a Joiner with Sorted Input, minimizing cache sizes and maximizing processing speed.

The whole process can be shown using the following diagrams. Diagram 1 shows how to implement if data needs to be sorted according to its “natural” ID after the aggregated values have been retrieved. Diagram 2 shows the principle for implementation if data needs to be sorted by the group ID after the aggregated values have been retrieved.

1- Aggregating Data Sorted by Natural ID

The steps in the above diagram are:

Step 1 - Sorting by Group ID: This ensures that the aggregation as well as the self-join can leverage the advantages of sorted input, meaning that both actions will only have to cache records for one single group ID instead of all data.

Step 2 - Aggregation: (i.e., the maximum value of some attribute per group can be extracted here, or some values can be summed up per group of data records).

Step 3 - Self-join with the Aggregated Data: This step retrieves all attributes of the one record per group which holds the aggregated value, maximum number, or whatever aggregation is needed here. The self-join takes place after the aggregated values have been sorted according to the “natural” ID of the source data.

Step 4 - Re-join with the Original Data: The records holding aggregated values are now re-joined with the original data. This way every record now bears the aggregated values along with its own attributes.

2- Aggregating data sorted by group ID

The steps in the above diagram are:

Step 1 - Sorting by group ID: assumed as already done.

Step 2 - Aggregation: (i.e., the maximum value of some attribute per group can be extracted here, or some values can be summed up per group of data records).

Step 3 - Self-join with the Aggregated Data: This step retrieves all attributes of the one record per group which holds the aggregated value, maximum number, or whatever aggregation is needed here. The self-join takes place after the aggregated values have been sorted according to the “natural” ID of the source data.

Step 4 - Re-join with the Original Data: The records holding aggregated values are now re-joined with the original data. This way every record now bears the aggregated values along with its own attributes.

In order to further minimize cache sizes for the session executing this example mapping, one might set up one “transaction” per “group” ID (in the sample case, customer ID and month) using a Transaction Control transformation (TCT). Based on the current values of the “group” ID an Expression transformation can deliver a flag to this TCT indicating whether the current transaction (i.e., the current group of records) is continued or whether a new “group” has begun. Setting all Sorters, Joiners, Aggregators and so on to a Transformation Scope of “Transaction” will allow the Integration Service to build caches just large enough to accommodate for one single group of records. This can reduce the sizes of the cache files to be built by (in extreme cases) more than 99%.

Case 2 – Subset Inclusion

Retrieving Data Based on a Subset of Values

Another fairly common use case for SQL Overrides is the selection of data based on a subset of values available in a lookup table or file. A typical case is a data table named A containing a sort of status information. This status information is itself stored in a table named B which contains several flags. Only those records A.* with special flag values B.* shall be used in the load process.

A SQL Override would “join” these tables via a complex SELECT statement joining records from table A with selected records from the controlling table B according to a complex condition.

This approach has two big disadvantages: first both entities have to be tables in a relational DBMS, second both entities have to exist within the same DBMS instance (or have to be addressed as if both were physically present within the same database).

There is one general approach to this requirement:

  1. Source the controlling data (entity B) and retrieve those “flag” values which need to be used as a filter condition to source the actual source data.
  2. Construct a suitable filter condition out of these records (i.e., a WHERE clause like this:

    CTRL_FLAG IN (‘A’, ‘C’, ‘T’, ‘X’)).

    If entity A is not a relational table and hence no Source Filter can be used in the Source Qualifier, construct a Filter condition in the PowerCenter transformation language like this:

    In( CTRL_FLAG, ‘A’, ‘C’, ‘T’, ‘X’)

Case 3 - Complex Lookup Logic

Another common use case for SQL Overrides is the selection of data for a Lookup transformation with some complex logic. For example, the base table for the Lookup contains 150 million records out of which only 200,000 records are needed for the lookup logic.

For this case there are two different approaches.

If the lookup logic needs data from one relational source table only, this feature is available in PowerCenter. Within the properties of a Lookup transformation the table from which to take lookup records is named but also a Source Filter condition can be entered that will be appended by the Integration Service to the automatically created SELECT statement allowing for quite complex filter logic.

An alternative is to associate the Lookup transformation with a relational Source Qualifier utilizing its standard features Source Filter and User-Defined Join. This is particularly useful if the data for the Lookup transformation originates from more than one source table.

If the lookup logic needs data from more than one source database or from sources other than relational tables, then the logic can be rebuilt as part of a normal PowerCenter mapping. Source Qualifier transformations with all their features, Joiner, Aggregator, and Filter transformations allow very complex transformation logic.

Finally, the data from the lookup entities can be combined with the main data to be processed using a Joiner transformation.

If both data streams can deliver their data sorted by a business key, then this final Joiner can be set up to leverage Sorted Input, allowing minimized cache files and maximized processing speed.

Case 4 - Recursively Stored Data

Imagine a customer needs the complete management chain for every employee in the entire organization. This means asking for the straight line from the respective member of the Board of Directors down to every employee, listing every manager on the intermediate levels.

In order to retrieve the management hierarchy from this storage entity, one single PowerCenter mapping could be utilized, but this mapping would require a Java Transformation (or some similarly working “black box”) to internally store, sort, and process the data and to output the resulting strings to a target system. Not every organization would want to maintain such a Java Transformation. So, here is a more generic approach to retrieving hierarchies of data.

  1. An auxiliary table is initialized with all top-level managers. The name of each of these managers is saved as the “hierarchy path”, the level indicator is set to 0.
  2. For the following session a parameter file is created with a mapping parameter $$LEVEL set to 0.
  3. The following session extracts all records from the source who are working as immediate dependents of the top-level managers (i.e., whose manager is a top-level manager). Each of these employees is written to the auxiliary table with the complete path and a level indicator of 1. In this sample case, this path is set to the level indicator in the auxiliary table is set to 1 (name $$LEVEL + 1).
  4. At least one record at a lower level in the hierarchy has been found, so the total count of lower-level records is > 0, meaning that the global level indicator is increased by 1 to a new value of 1.
  5. For the following session run the parameter file is re-created with the mapping parameter $$LEVEL set to 1.
  6. The session now extracts all records from the source who are working as immediate dependents of those managers extracted with level indicator = 1.
  7. As the current session run has extracted more than zero records, the global level indicator is increased from 1 to 2, the session will be run again and write the sequential groups to the auxiliary table with a level indicator of 3.
  8. This process is repeated until there are no dependents of the last record, and thus there will be no output records to the auxiliary table, meaning that the whole process now terminates.

It is important to note that all these steps can be implemented in PowerCenter, but not within one single workflow. It is mandatory to check whether the extract process has to be repeated. However, as workflows cannot restart themselves immediately, this check must be performed by another process (possibly a second workflow) which whenever needed, restarts the extraction process. After the extraction process has finished and written its output, control has to be handed back to the process that is checking whether another iteration is required.

As the check process and the extraction process cannot be implemented within the same PowerCenter workflow, two workflows invoking each other work fine.

Summary of Use Cases

Even highly complex business requirements (such as processing data stored in hierarchical data structures or self-referencing relational structures) can be handled by modern versatile ETL tools using their standard technology.

Sometimes auxiliary measures are helpful (e.g., short Perl or shell scripts, embedded Java code, etc.). When used with caution, such little helpers greatly increase the usefulness and flexibility of the ETL processes while keeping the focus on scalability, transparency, ease of maintenance, portability, and performance.

Conclusion

For a variety of reasons, many ETL developers refer to complex SQL statements in order to implement moderately or highly complex business logic. Very often these reasons include better working knowledge with the ODS DBMS than with the ETL tools, the need for special functionality provided by a DBMS, or past experience with DBMS servers yielding better performance than ETL processes.

While there are special cases in which such SQL statements do make sense, they should be used as a last resort if all other measures fail. They are not scalable; hide transformation and business logic; increase maintenance efforts; are usually not portable between different DBMS; and require special knowledge and experience with the respective DBMS.

Several sample use cases have shown a few standard approaches on how to avoid SQL overrides or to at least decrease the need for them. Even highly complex logic usually can be replaced by ETL processes. Also, good ETL tools provide users with various features to extend the standard functionality on all levels of process implementation without compromising scalability, performance, and portability.

Table of Contents

Success

Link Copied to Clipboard