Informatica PowerCenter comes with a built-in feature that permits the use of user-defined SQL queries through ‘SQL Query’ and ‘Lookup Override’ options available within Source Qualifier and Lookup transformations respectively. This feature is useful in some scenarios. However, adding all business logic to SQL (such as data transformations, sub-queries, or case statements) is not always the best way to leverage PowerCenter’s capabilities. SQL overrides hide the traceability of business rules, create maintenance complexity, constrain the ability to tune PowerCenter mappings for performance (since all the work is being done at the underlying database level), are rarely portable among different DBMS and constrain the ability to work with source systems other than a relational DBMS.
This Best Practice document provides general guidelines for Informatica PowerCenter users on how SQL overrides can in many cases be avoided without compromising performance.
There are quite a few typical use cases for SQL Overrides. While it is not always possible to avoid SQL Overrides completely, there are many cases where the use of Source Filters or SQL Overrides does not provide a real benefit (in particular in terms of performance and maintainability). In these cases, it is advisable to look for alternative implementations.
This document describes situations where SQL Overrides are typically leveraged, but where it makes sense to at least try alternative approaches for implementation.
Below are four common situations where SQL Overrides or Source Filters are used. This list briefly describes these use cases which will be analyzed and treated in more detail in subsequent sections.
For a variety of reasons, SQL Overrides are commonly used throughout the Informatica PowerCenter world. Below are a few of the commonly used arguments for their widespread use:
It is true that PowerCenter does not build sub-SELECT statements on its own without further effort (primarily performed by the add-on option named Pushdown Optimization). However, sub-queries always put additional burden on the DBMS machine. Not every DBMS can cope with moderately or highly complex queries equally well. It is almost always advisable to try other approaches.
It is often assumed that DBMS servers are equipped with more and faster CPUs, more memory, and faster hard disks than the PowerCenter servers connecting to these databases. This was fairly common when PowerCenter was a 32-bit application and many DBMS were available as 64-bit applications. However, this assumption is no longer valid in many cases (not only because PowerCenter is no longer available as a 32-bit application on UNIX platforms). Informatica highly recommends asking the customer how these machines are equipped before making any assumptions about which task runs faster on which machine.
Assume data is logically “grouped” by an ID. A value needs to be calculated for every “group” of records and finally this aggregated value has to be assigned to every record of the same “group” (or some details from the record having this aggregated value have to be assigned to all other records of the group).
One example would be a company with many manufacturing subsidiaries all over the world where all staffing costs and gross revenues per subsidiary and department are calculated. Then, for every single department of every subsidiary, the (positive or negative) relative difference to the average of all subsidiaries is retrieved.
In classic SQL based applications sub-SELECT statements would gather the detail records for the one record per group holding / yielding the aggregated value. Then this “leading” record per group would be re-joined with the original data.
In many cases, this step can be implemented leveraging an Aggregator with Sorted Input, minimizing cache files and maximizing processing speed. Because the data is still sorted by group ID(s), this join can be executed using a Joiner with Sorted Input, minimizing cache sizes and maximizing processing speed.
The whole process can be shown using the following diagrams. Diagram 1 shows how to implement if data needs to be sorted according to its “natural” ID after the aggregated values have been retrieved. Diagram 2 shows the principle for implementation if data needs to be sorted by the group ID after the aggregated values have been retrieved.
1- Aggregating Data Sorted by Natural ID
The steps in the above diagram are:
Step 1 - Sorting by Group ID: This ensures that the aggregation as well as the self-join can leverage the advantages of sorted input, meaning that both actions will only have to cache records for one single group ID instead of all data.
Step 2 - Aggregation: (i.e., the maximum value of some attribute per group can be extracted here, or some values can be summed up per group of data records).
Step 3 - Self-join with the Aggregated Data: This step retrieves all attributes of the one record per group which holds the aggregated value, maximum number, or whatever aggregation is needed here. The self-join takes place after the aggregated values have been sorted according to the “natural” ID of the source data.
Step 4 - Re-join with the Original Data: The records holding aggregated values are now re-joined with the original data. This way every record now bears the aggregated values along with its own attributes.
2- Aggregating data sorted by group ID
The steps in the above diagram are:
Step 1 - Sorting by group ID: assumed as already done.
Step 2 - Aggregation: (i.e., the maximum value of some attribute per group can be extracted here, or some values can be summed up per group of data records).
Step 3 - Self-join with the Aggregated Data: This step retrieves all attributes of the one record per group which holds the aggregated value, maximum number, or whatever aggregation is needed here. The self-join takes place after the aggregated values have been sorted according to the “natural” ID of the source data.
Step 4 - Re-join with the Original Data: The records holding aggregated values are now re-joined with the original data. This way every record now bears the aggregated values along with its own attributes.
In order to further minimize cache sizes for the session executing this example mapping, one might set up one “transaction” per “group” ID (in the sample case, customer ID and month) using a Transaction Control transformation (TCT). Based on the current values of the “group” ID an Expression transformation can deliver a flag to this TCT indicating whether the current transaction (i.e., the current group of records) is continued or whether a new “group” has begun. Setting all Sorters, Joiners, Aggregators and so on to a Transformation Scope of “Transaction” will allow the Integration Service to build caches just large enough to accommodate for one single group of records. This can reduce the sizes of the cache files to be built by (in extreme cases) more than 99%.
Another fairly common use case for SQL Overrides is the selection of data based on a subset of values available in a lookup table or file. A typical case is a data table named A containing a sort of status information. This status information is itself stored in a table named B which contains several flags. Only those records A.* with special flag values B.* shall be used in the load process.
A SQL Override would “join” these tables via a complex SELECT statement joining records from table A with selected records from the controlling table B according to a complex condition.
This approach has two big disadvantages: first both entities have to be tables in a relational DBMS, second both entities have to exist within the same DBMS instance (or have to be addressed as if both were physically present within the same database).
There is one general approach to this requirement:
Another common use case for SQL Overrides is the selection of data for a Lookup transformation with some complex logic. For example, the base table for the Lookup contains 150 million records out of which only 200,000 records are needed for the lookup logic.
For this case there are two different approaches.
If the lookup logic needs data from one relational source table only, this feature is available in PowerCenter. Within the properties of a Lookup transformation the table from which to take lookup records is named but also a Source Filter condition can be entered that will be appended by the Integration Service to the automatically created SELECT statement allowing for quite complex filter logic.
An alternative is to associate the Lookup transformation with a relational Source Qualifier utilizing its standard features Source Filter and User-Defined Join. This is particularly useful if the data for the Lookup transformation originates from more than one source table.
If the lookup logic needs data from more than one source database or from sources other than relational tables, then the logic can be rebuilt as part of a normal PowerCenter mapping. Source Qualifier transformations with all their features, Joiner, Aggregator, and Filter transformations allow very complex transformation logic.
Finally, the data from the lookup entities can be combined with the main data to be processed using a Joiner transformation.
If both data streams can deliver their data sorted by a business key, then this final Joiner can be set up to leverage Sorted Input, allowing minimized cache files and maximized processing speed.
Imagine a customer needs the complete management chain for every employee in the entire organization. This means asking for the straight line from the respective member of the Board of Directors down to every employee, listing every manager on the intermediate levels.
In order to retrieve the management hierarchy from this storage entity, one single PowerCenter mapping could be utilized, but this mapping would require a Java Transformation (or some similarly working “black box”) to internally store, sort, and process the data and to output the resulting strings to a target system. Not every organization would want to maintain such a Java Transformation. So, here is a more generic approach to retrieving hierarchies of data.
It is important to note that all these steps can be implemented in PowerCenter, but not within one single workflow. It is mandatory to check whether the extract process has to be repeated. However, as workflows cannot restart themselves immediately, this check must be performed by another process (possibly a second workflow) which whenever needed, restarts the extraction process. After the extraction process has finished and written its output, control has to be handed back to the process that is checking whether another iteration is required.
As the check process and the extraction process cannot be implemented within the same PowerCenter workflow, two workflows invoking each other work fine.
Even highly complex business requirements (such as processing data stored in hierarchical data structures or self-referencing relational structures) can be handled by modern versatile ETL tools using their standard technology.
Sometimes auxiliary measures are helpful (e.g., short Perl or shell scripts, embedded Java code, etc.). When used with caution, such little helpers greatly increase the usefulness and flexibility of the ETL processes while keeping the focus on scalability, transparency, ease of maintenance, portability, and performance.
For a variety of reasons, many ETL developers refer to complex SQL statements in order to implement moderately or highly complex business logic. Very often these reasons include better working knowledge with the ODS DBMS than with the ETL tools, the need for special functionality provided by a DBMS, or past experience with DBMS servers yielding better performance than ETL processes.
While there are special cases in which such SQL statements do make sense, they should be used as a last resort if all other measures fail. They are not scalable; hide transformation and business logic; increase maintenance efforts; are usually not portable between different DBMS; and require special knowledge and experience with the respective DBMS.
Several sample use cases have shown a few standard approaches on how to avoid SQL overrides or to at least decrease the need for them. Even highly complex logic usually can be replaced by ETL processes. Also, good ETL tools provide users with various features to extend the standard functionality on all levels of process implementation without compromising scalability, performance, and portability.