• Success
    Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs
    Accelerate your Purchase to Value engaging with Informatica Architects for Customer Success
    All your Engagements at one place
  • Communities
    A collaborative platform to connect and grow with like-minded Informaticans across the globe
    Connect and collaborate with Informatica experts and champions
    Have a question? Start a Discussion and get immediate answers you are looking for
    Customer-organized groups that meet online and in-person. Join today to network, share ideas, and get tips on how to get the most out of Informatica
  • Knowledge Center
    Troubleshooting documents, product guides, how to videos, best practices, and more
    Knowledge Center
    One-stop self-service portal for solutions, FAQs, Whitepapers, How Tos, Videos, and more
    Video channel for step-by-step instructions to use our products, best practices, troubleshooting tips, and much more
    Information library of the latest product documents
    Best practices and use cases from the Implementation team
  • Learn
    Rich resources to help you leverage full capabilities of our products
    Role-based training programs for the best ROI
    Get certified on Informatica products. Free, Foundation, or Professional
    Free and unlimited modules based on your expertise level and journey
    Self-guided, intuitive experience platform for outcome-focused product capabilities and use cases
  • Resources
    Library of content to help you leverage the best of Informatica products
    Most popular webinars on product architecture, best practices, and more
    Product Availability Matrix statements of Informatica products
    Monthly support newsletter
    Informatica Support Guide and Statements, Quick Start Guides, and Cloud Product Description Schedule
    End of Life statements of Informatica products
Last Updated Date Aug 18, 2022 |

Business Goal

With the explosion of the digital economy, data unlocks insights that enable greater profitability, uncover new opportunities, proactively detect problems, accelerate product and service innovation, and deliver exceptional customer experiences. Retailers use data to analyze customer behavior and personalize the shopping experience. Manufacturers use data to fine-tune their global supply chain. Healthcare providers leverage data to care for patients better and at a lower cost. Utilities and telecom companies use data to inform predictive maintenance and avoid service outages.


To meet ever-changing and aggressive business demands and maintain competitive advantage, organizations need to invest in processes that allow them to crunch their historical, batch, and real-time data against such models that have capabilities to predict business outcomes and take preventive measures.


But leveraging and gaining access to data in a fast and easy manner is getting harder. As organizations are increasingly automating and digitizing their processes and products, more data is generated. Massive amounts of data are contained and locked in many different applications and systems that span an organization’s IT infrastructure in the cloud and private data centers. With the massive amounts of data, it is getting more difficult for organizations to find and use this data to gain insights and attain business value.


Organizations have traditionally used enterprise data warehouses (EDW) to analyze, report and store operational data, but the traditional EDW is limited in today’s real-time, data-driven world. An EDW is optimized to store structured and modelled relational data coming from transactional systems and line-of-business applications. It is not designed to deal with the volume and variety of both unstructured non-relational and structured data available in today’s organizations. This limitation has spurred the innovation and development of new technologies that are able to handle any type data, of any size and at great speed. The implementations of these new technologies in organizations are generally revered to as data lakes.


A data lake enables organizations to store all structured and unstructured data efficiently, cost-effectively, and at any scale. In addition to storing relational data from line-of-business and transactional applications, a data lake can store unstructured and real-time data from mobile apps, IoT devices, sensors, and social media.


Data lakes deliver new and innovative data analysis strategies such as schema on read, which is the capability of storing data without first defining the structure. By not forcing the data into a pre-determined model we retain more information that can be used answers future questions with greater detail and more confidence. The data lake ecosystem offers a broad variety of tools to provide analytics capabilities, from dashboards and visualizations to SQL queries, big-data processing, full text search, real-time analytics, and machine learning to guide data-driven decision-making or to take action.


Data Lake storage architectures are at the heart of many data lake implementations. These offer a reliable and scalable way to store and process large data sets, whether structured, semi-structured, or unstructured.


Getting Started with a Data Lake

Let’s explore how to get started with a data lake—beginning with ingesting data and following up with data governance, data prep, building and training the models, operationalizing the models, and reporting and analytics.


1. Ingest Data

Data comes from a variety of structured and unstructured data sources; therefore, a data lake must be able to ingest a broad variety of data such as batch, near-real-time, and real-time streaming. With a data lake, businesses no longer must load and stage all their data into an EDW. Data of all kinds can be stored in the storage platform, which is lower cost and highly scalable, increasing query performance. 


2. Data Governance

Without quality data, the insights and actions of analytics and operational systems can be misguided or produce inaccurate results. A data lake must therefore be governed, so the data stays accurate and remains consistent. Another reason for data governance is to comply with applicable regulations and laws. This is especially important in highly regulated industries, such as financial services and healthcare.


Organizations need to have an integrated view of their data to ensure compliance with data privacy regulations. The end-to-end visibility into the data lineage is therefore critical to fully understand the business impact, interdependencies, duplications, and fragmentation. Informatica Cloud Data Quality, Enterprise Data Catalog (EDC), and Axon Data Governance tech stack play an integral role in this space. These products can allow for data can be automatically cataloged, classified, and enriched to gain a complete view of the data with lineage


3. Data Prep

There are two approaches to preparing the data so it can be stored in a single repository: top-down or bottom-up. In a top-down approach, which is typical of an EDW, the schema is predefined based on known use cases, and the data is transformed and stored based on the predefined schema. This approach is suitable for getting information about what has happened in the past. In a bottom-up approach, which is typical for a data lake, there is no need to define the schema at the time of storage. Discovery can be done using raw data or the analysts can experiment to find patterns that are beneficial to the business. This is suitable to determine predictive patterns and gain insight into what will happen in the future. Informatica’s Enterprise Data Catalog and Cloud Data Integration tech stack play an integral role in this space, enabling business users and data analysts the ability to explore, profile, and transform data dynamically and in an interactive way.


4. Build and Train the Data Models.

These models are usually built either by statistical modeling, machine learning (ML) or deep learning (DL) methods. A typical data scientist experiments with the data sets provided by the business analyst or data engineer to build these models. From there, the models are continuously trained using data that is being ingested into the data lake in real time or near real time. The models are trained and scored until they produce desirable results, and the data scientist determines the models have attained a certain maturity level. Open-source or third-party data algorithms shall be leveraged to score models in real time or near-real-time scenarios to derive actionable insights and predict business outcomes.


5. Operationalize the Models.

Once the ML/DL models have reached a certain maturity, they are deployed into production for ongoing insights and continuous scoring. Most data scientists don’t wait for 100% maturity to deploy, but instead, it’s an iterative process of continuous improvement. The models are deployed, scored, validated, and deployed again in a cyclical process to derive more accurate outcomes.


6. Reporting and Analytics.

Operational reporting systems can continue to run on the EDW, using traditional tools like Cognos, Tableau, and Microsoft Power BI. But many organizations also want to derive real-time insights from the streaming or near-real-time data that is flowing into the data lake. These visualizations can easily be created with tools like Tableau and Power BI. A key benefit is the ability to correlate data between the two ecosystems to drive more insights and take preventative action. Increasingly, organizations are using analytics engines to unlock the full value of the data and drive actionable insights and further enrichment. There are several ways to build an analytics engine, but one of the best practices is to leverage NoSQL database, such as Cassandra or MongoDB, to ingest and transform the processed data that is in the storage platform and/or in EDW.

Table of Contents


Cloud Datawarehouse & Data Lake





Best Practice





Link Copied to Clipboard