Cloud Data Warehouse Journey

With Informatica’s IPU model, all Informatica Cloud Data Management services are available for use and customers are only charged for what they use. This gives customers the flexibility to consistently use the right tool for the right job without licensing concerns. Some of the best practices for which services to use, and when, for Cloud Data Warehousing are identified below.

The principal consideration for a Cloud Data Warehousing use case is how to process the data. Informatica’s Optimization Engine is unique to the market in that it offers multiple modes of execution; however, this requires an understanding of what the engines are, how they work, and when to use them. The three primary engines are identified below, along with additional information about their intended usage.

Advanced Pushdown Optimization (APDO): Informatica’s ELT engine, APDO, can push the work down to underlying technology, like a Cloud Data Warehouse (CDW) or Cloud Data Lake. This allows customers to leverage their investment in these technologies while avoiding data transfer costs. APDO can be used when moving data from Cloud Object Storage (S3, ADLS, GFS) to a CDW with no hierarchical processing or advanced data integration transformation requirements. It is also the primary recommendation for processing data that has already been loaded into a CDW but needs additional transformation. In addition to the cost savings, APDO can offer performance benefits in excess of 20x compared to traditional Spark engines.

Cloud Data Integration – elastic (CDI-e): Informatica’s elastic engine can grow and shrink the number of servers used for distributed processing of typically larger data sets. Using Artificial Intelligence and Machine Learning algorithms to determine the appropriate number of servers, internal benchmark studies have shown Informatica CDI-e to be among the most cost-efficient processing engines available in a Cloud ecosystem.

It is most often used in Data Lake use cases where data is moved from Cloud Object Storage (Data Lake) to Cloud Object Storage (Data Lake) with transformations in between.
It can also be used to move data from Cloud Object Storage to a CDW when the source data is in a hierarchical format, and the source data is partitioned, there is a need for high concurrency (Over 20 concurrent jobs), or advanced data integration functionality is required for processing.
CDI-e can also be used for a pure Cloud Data Warehouse use case when data in a CDW needs to be processed. Because the data will be extracted from the CDW, potentially incurring data transfer costs, this is not the primary recommendation for CDW processing, but could be used for high concurrency, advanced data integration tasks, and if the data volumes are large enough to incur processing cost savings in excess of any data transfer fees.

Cloud Data Integration (CDI): A traditional ETL engine, CDI will extract the data from a source system, process and transform it on an Informatica Secure Agent deployed to a server, and load it into a target. CDI is most often used for Ingestion use cases when the data is not already in a Cloud ecosystem. It can also be used to move data within a Cloud ecosystem when other engines are unavailable or inappropriate.

In addition to the above-listed services, Informatica has two additional services as part of the best practices for the Cloud Data Warehouse use case.

Advanced Serverless: When deploying any of Informatica’s execution engines, it has traditionally been the customer’s responsibility to provide the hardware that Informatica will run on – even the pushdown functionality needs to be sent from a Secure Agent running on a server.

This provisioning of hardware includes, but is not limited to:

Selecting the hardware
Picking the Operating System
Defining the configuration options
Creating and decommissioning the hardware if not using CDI-E

Moreover, Informatica typically doesn’t recommend running more than 20 concurrent jobs from a single Secure Agent without adding more Secure Agents, which most often requires adding more hardware.

A better solution is to run in Advanced Serverless mode. In this fashion, all of the processing is done on hardware that Informatica provisions but is part of a Security Group with a Tenant Controlled Trusted ENI connection that prohibits Informatica from accessing the environment/data. This approach leverages functionality provided by the Cloud ecosystems, is used by many of the ecosystem native tools, and offers the ability to outsource hardware considerations to Informatica while maintaining the security and control that cloud customers are looking for. For more information on the seven layers of enterprise-grade security protection, this Security Whitepaper is available for review: Security Whitepaper

Advanced Data Integration/Cloud Data Quality (CDQ): While Informatica’s Cloud Data Quality functionality can be used in Data Governance initiatives, the pre-built transformations are also invaluable assets for the types of advanced data integration functionality typically seen in CDW use cases. Specifically, this service can be used for the following:

Profiling: Similar to getting turn-by-turn directions instead of simply “heading west,” profiling the source data can help identify roadblocks that will cause a data integration pipeline to fail.
Standardization: If one business unit uses a rating system of gold, silver, and bronze and another uses one, two, or three, these values must be standardized for effective reporting.
Text Parsing: We see text data coming from social media posts, call center comments, and product descriptions; among other sources, to properly analyze this data, key data values must be extracted from the text.
Fuzzy Matching: Source data, particularly from outside an organization, does not always match a System of Record for a data pipeline’s lookup functionality – fuzzy matching can help address this concern.
Validation: Setting up validation checks can quickly and easily ensure that erroneous data values/records, especially duplicate records, aren’t loaded into a CDW and hinder the effectiveness of the CDW.
Enrichment: Many organizations are looking to do geo-spatial analysis – it’s no longer good enough to analyze who lives in New York City; organizations want to know who lives within one mile of a particular landmark. Informatica offers the ability to append enriched data like geo-codes to the data.

This functionality, used to address advanced data integration tasks that would otherwise need to be coded, is available through pre-built transformations as part of Informatica’s Cloud Data Quality service. Also included are hundreds of out-of-the-box accelerators that can expedite development and deployment. And finally, Cloud Data Quality relies on a series of dictionaries for much of the processing, many of which are provided out-of-the-box, can be generated through Data Profiling, or can be built and maintained by business analysts responsible for data ownership and stewardship.

Cloud Data Ingestion and Replication (CDIR): Informatica's solution, CDIR, is strategically designed to streamline and enhance the processes of data ingestion and replication across distributed data environments, making it crucial for organizations aiming to synchronize data across multiple platforms in real-time. This service is especially vital when enterprises have data dispersed across various sources, such as on-premises databases, cloud-based storage, or even multi-cloud environments.

It supports different use cases: