Data Lake

Success

Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs

Success

Success Accelerators

Accelerate your Purchase to Value engaging with Informatica Architects for Customer Success

My Engagements

All your Engagements at one place
Communities

A collaborative platform to connect and grow with like-minded Informaticans across the globe

Communities

Product Communities

Connect and collaborate with Informatica experts and champions

Discussions

Have a question? Start a Discussion and get immediate answers you are looking for

User Groups

Customer-organized groups that meet online and in-person. Join today to network, share ideas, and get tips on how to get the most out of Informatica

Get Started

Community Guidelines
Knowledge Center

Troubleshooting documents, product guides, how to videos, best practices, and more

Knowledge Center

Knowledge Base

One-stop self-service portal for solutions, FAQs, Whitepapers, How Tos, Videos, and more

Support TV

Video channel for step-by-step instructions to use our products, best practices, troubleshooting tips, and much more

Documentation

Information library of the latest product documents

Velocity (Best Practices)

Best practices and use cases from the Implementation team
Learn

Rich resources to help you leverage full capabilities of our products

Learn

Trainings

Role-based training programs for the best ROI

Certifications

Get certified on Informatica products. Free, Foundation, or Professional

Product Learning Paths

Free and unlimited modules based on your expertise level and journey

Experience Lounge

Self-guided, intuitive experience platform for outcome-focused product capabilities and use cases
Resources

Library of content to help you leverage the best of Informatica products

Resources

Tech Tuesdays Webinars

Most popular webinars on product architecture, best practices, and more

Product Availability Matrix

Product Availability Matrix statements of Informatica products

SupportFlash

Monthly support newsletter

Support Documents

Informatica Support Guide and Statements, Quick Start Guides, and Cloud Product Description Schedule

Product Lifecycle

End of Life statements of Informatica products

Ideas

Events

Change Request Tracking

Marketplace

| Sign up

Velocity
Strategy

Strategy

Data Strategy

Centers of Excellence

Enterprise Data Governance

Enterprise Architecture

Program & Change Management
Solutions

Solutions

Cloud Data Warehouse & Data Lake

Data Lake

Data Warehouse Modernization

Analytics Modernization

Application Integration

360 Engagement

Multidomain MDM

Customer 360 SaaS

Product 360

Supplier 360

Reference 360

Data Governance & Privacy

Data Catalog & Metadata Management

Data Privacy

Regulatory Compliance

Data Quality

Data Access and Provisioning
Stages

Stages

Cloud Data Warehouse & Data Lake

360 Engagement

Data Governance & Privacy

Following a rigorous methodology is key to delivering customer satisfaction and expanding analytics use cases across the business.
More
- Success
  
  Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs
  
  Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs
  
  Success Accelerators
  
  Accelerate your Purchase to Value engaging with Informatica Architects for Customer Success
  
  My Engagements
  
  All your Engagements at one place
- Communities
  
  A collaborative platform to connect and grow with like-minded Informaticans across the globe
  
  A collaborative platform to connect and grow with like-minded Informaticans across the globe
  
  Product Communities
  
  Connect and collaborate with Informatica experts and champions
  
  Discussions
  
  Have a question? Start a Discussion and get immediate answers you are looking for
  
  User Groups
  
  Customer-organized groups that meet online and in-person. Join today to network, share ideas, and get tips on how to get the most out of Informatica
  
  Get Started
  
  Community Guidelines
- Knowledge Center
  
  Troubleshooting documents, product guides, how to videos, best practices, and more
  
  Troubleshooting documents, product guides, how to videos, best practices, and more
  
  Knowledge Base
  
  One-stop self-service portal for solutions, FAQs, Whitepapers, How Tos, Videos, and more
  
  Support TV
  
  Video channel for step-by-step instructions to use our products, best practices, troubleshooting tips, and much more
  
  Documentation
  
  Information library of the latest product documents
  
  Velocity (Best Practices)
  
  Best practices and use cases from the Implementation team
- Learn
  
  Rich resources to help you leverage full capabilities of our products
  
  Rich resources to help you leverage full capabilities of our products
  
  Trainings
  
  Role-based training programs for the best ROI
  
  Certifications
  
  Get certified on Informatica products. Free, Foundation, or Professional
  
  Product Learning Paths
  
  Free and unlimited modules based on your expertise level and journey
  
  Experience Lounge
  
  Self-guided, intuitive experience platform for outcome-focused product capabilities and use cases
- Resources
  
  Library of content to help you leverage the best of Informatica products
  
  Library of content to help you leverage the best of Informatica products
  
  Tech Tuesdays Webinars
  
  Most popular webinars on product architecture, best practices, and more
  
  Product Availability Matrix
  
  Product Availability Matrix statements of Informatica products
  
  SupportFlash
  
  Monthly support newsletter
  
  Support Documents
  
  Informatica Support Guide and Statements, Quick Start Guides, and Cloud Product Description Schedule
  
  Product Lifecycle
  
  End of Life statements of Informatica products
  
  Ideas
  
  Events
  
  Change Request Tracking
  
  Marketplace

Last Updated Date Aug 18, 2022 |

Business Goal

With the explosion of the digital economy, data unlocks insights that enable greater profitability, uncover new opportunities, proactively detect problems, accelerate product and service innovation, and deliver exceptional customer experiences. Retailers use data to analyze customer behavior and personalize the shopping experience. Manufacturers use data to fine-tune their global supply chain. Healthcare providers leverage data to care for patients better and at a lower cost. Utilities and telecom companies use data to inform predictive maintenance and avoid service outages.

To meet ever-changing and aggressive business demands and maintain competitive advantage, organizations need to invest in processes that allow them to crunch their historical, batch, and real-time data against such models that have capabilities to predict business outcomes and take preventive measures.

But leveraging and gaining access to data in a fast and easy manner is getting harder. As organizations are increasingly automating and digitizing their processes and products, more data is generated. Massive amounts of data are contained and locked in many different applications and systems that span an organization’s IT infrastructure in the cloud and private data centers. With the massive amounts of data, it is getting more difficult for organizations to find and use this data to gain insights and attain business value.

Organizations have traditionally used enterprise data warehouses (EDW) to analyze, report and store operational data, but the traditional EDW is limited in today’s real-time, data-driven world. An EDW is optimized to store structured and modelled relational data coming from transactional systems and line-of-business applications. It is not designed to deal with the volume and variety of both unstructured non-relational and structured data available in today’s organizations. This limitation has spurred the innovation and development of new technologies that are able to handle any type data, of any size and at great speed. The implementations of these new technologies in organizations are generally revered to as data lakes.

A data lake enables organizations to store all structured and unstructured data efficiently, cost-effectively, and at any scale. In addition to storing relational data from line-of-business and transactional applications, a data lake can store unstructured and real-time data from mobile apps, IoT devices, sensors, and social media.

Data lakes deliver new and innovative data analysis strategies such as schema on read, which is the capability of storing data without first defining the structure. By not forcing the data into a pre-determined model we retain more information that can be used answers future questions with greater detail and more confidence. The data lake ecosystem offers a broad variety of tools to provide analytics capabilities, from dashboards and visualizations to SQL queries, big-data processing, full text search, real-time analytics, and machine learning to guide data-driven decision-making or to take action.

Data Lake storage architectures are at the heart of many data lake implementations. These offer a reliable and scalable way to store and process large data sets, whether structured, semi-structured, or unstructured.

Getting Started with a Data Lake

Let’s explore how to get started with a data lake—beginning with ingesting data and following up with data governance, data prep, building and training the models, operationalizing the models, and reporting and analytics.

1. Ingest Data

Data comes from a variety of structured and unstructured data sources; therefore, a data lake must be able to ingest a broad variety of data such as batch, near-real-time, and real-time streaming. With a data lake, businesses no longer must load and stage all their data into an EDW. Data of all kinds can be stored in the storage platform, which is lower cost and highly scalable, increasing query performance.

2. Data Governance

Without quality data, the insights and actions of analytics and operational systems can be misguided or produce inaccurate results. A data lake must therefore be governed, so the data stays accurate and remains consistent. Another reason for data governance is to comply with applicable regulations and laws. This is especially important in highly regulated industries, such as financial services and healthcare.

Organizations need to have an integrated view of their data to ensure compliance with data privacy regulations. The end-to-end visibility into the data lineage is therefore critical to fully understand the business impact, interdependencies, duplications, and fragmentation. Informatica Cloud Data Quality, Enterprise Data Catalog (EDC), and Axon Data Governance tech stack play an integral role in this space. These products can allow for data can be automatically cataloged, classified, and enriched to gain a complete view of the data with lineage

3. Data Prep

There are two approaches to preparing the data so it can be stored in a single repository: top-down or bottom-up. In a top-down approach, which is typical of an EDW, the schema is predefined based on known use cases, and the data is transformed and stored based on the predefined schema. This approach is suitable for getting information about what has happened in the past. In a bottom-up approach, which is typical for a data lake, there is no need to define the schema at the time of storage. Discovery can be done using raw data or the analysts can experiment to find patterns that are beneficial to the business. This is suitable to determine predictive patterns and gain insight into what will happen in the future. Informatica’s Enterprise Data Catalog and Cloud Data Integration tech stack play an integral role in this space, enabling business users and data analysts the ability to explore, profile, and transform data dynamically and in an interactive way.

4. Build and Train the Data Models.

These models are usually built either by statistical modeling, machine learning (ML) or deep learning (DL) methods. A typical data scientist experiments with the data sets provided by the business analyst or data engineer to build these models. From there, the models are continuously trained using data that is being ingested into the data lake in real time or near real time. The models are trained and scored until they produce desirable results, and the data scientist determines the models have attained a certain maturity level. Open-source or third-party data algorithms shall be leveraged to score models in real time or near-real-time scenarios to derive actionable insights and predict business outcomes.

5. Operationalize the Models.

Once the ML/DL models have reached a certain maturity, they are deployed into production for ongoing insights and continuous scoring. Most data scientists don’t wait for 100% maturity to deploy, but instead, it’s an iterative process of continuous improvement. The models are deployed, scored, validated, and deployed again in a cyclical process to derive more accurate outcomes.

6. Reporting and Analytics.

Operational reporting systems can continue to run on the EDW, using traditional tools like Cognos, Tableau, and Microsoft Power BI. But many organizations also want to derive real-time insights from the streaming or near-real-time data that is flowing into the data lake. These visualizations can easily be created with tools like Tableau and Power BI. A key benefit is the ability to correlate data between the two ecosystems to drive more insights and take preventative action. Increasingly, organizations are using analytics engines to unlock the full value of the data and drive actionable insights and further enrichment. There are several ways to build an analytics engine, but one of the best practices is to leverage NoSQL database, such as Cassandra or MongoDB, to ingest and transform the processed data that is in the storage platform and/or in EDW.

RESOURCES

Cloud Datawarehouse & Data Lake

Article

PLAN

Article

IMPLEMENT

Best Practice

MONITOR

Article

OPTIMIZE

Data Lake

Solutions

Business Goal

Getting Started with a Data Lake

1. Ingest Data

2. Data Governance

3. Data Prep

4. Build and Train the Data Models.

5. Operationalize the Models.

6. Reporting and Analytics.

Table of Contents