Data Lake Architecture

Success

Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs

Success

Success Accelerators

Accelerate your Purchase to Value engaging with Informatica Architects for Customer Success

My Engagements

All your Engagements at one place
Communities

A collaborative platform to connect and grow with like-minded Informaticans across the globe

Communities

Product Communities

Connect and collaborate with Informatica experts and champions

Discussions

Have a question? Start a Discussion and get immediate answers you are looking for

User Groups

Customer-organized groups that meet online and in-person. Join today to network, share ideas, and get tips on how to get the most out of Informatica

Get Started

Community Guidelines
Knowledge Center

Troubleshooting documents, product guides, how to videos, best practices, and more

Knowledge Center

Knowledge Base

One-stop self-service portal for solutions, FAQs, Whitepapers, How Tos, Videos, and more

Support TV

Video channel for step-by-step instructions to use our products, best practices, troubleshooting tips, and much more

Documentation

Information library of the latest product documents

Velocity (Best Practices)

Best practices and use cases from the Implementation team
Learn

Rich resources to help you leverage full capabilities of our products

Learn

Trainings

Role-based training programs for the best ROI

Certifications

Get certified on Informatica products. Free, Foundation, or Professional

Product Learning Paths

Free and unlimited modules based on your expertise level and journey

Experience Lounge

Self-guided, intuitive experience platform for outcome-focused product capabilities and use cases
Resources

Library of content to help you leverage the best of Informatica products

Resources

Tech Tuesdays Webinars

Most popular webinars on product architecture, best practices, and more

Product Availability Matrix

Product Availability Matrix statements of Informatica products

SupportFlash

Monthly support newsletter

Support Documents

Informatica Support Guide and Statements, Quick Start Guides, and Cloud Product Description Schedule

Product Lifecycle

End of Life statements of Informatica products

Ideas

Events

Change Request Tracking

Marketplace

| Sign up

Velocity
Strategy

Strategy

Data Strategy

Centers of Excellence

Enterprise Data Governance

Enterprise Architecture

Program & Change Management
Solutions

Solutions

Cloud Data Warehouse & Data Lake

Data Lake

Data Warehouse Modernization

Analytics Modernization

Application Integration

360 Engagement

Multidomain MDM

Customer 360

Product 360

Supplier 360

Reference 360

Data Governance & Privacy

Data Catalog & Metadata Management

Data Privacy

Regulatory Compliance

Data Quality

Data Access and Provisioning
Stages

Stages

Cloud Data Warehouse & Data Lake

360 Engagement

Data Governance & Privacy

Following a rigorous methodology is key to delivering customer satisfaction and expanding analytics use cases across the business.
More
- Success
  
  Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs
  
  Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs
  
  Success Accelerators
  
  Accelerate your Purchase to Value engaging with Informatica Architects for Customer Success
  
  My Engagements
  
  All your Engagements at one place
- Communities
  
  A collaborative platform to connect and grow with like-minded Informaticans across the globe
  
  A collaborative platform to connect and grow with like-minded Informaticans across the globe
  
  Product Communities
  
  Connect and collaborate with Informatica experts and champions
  
  Discussions
  
  Have a question? Start a Discussion and get immediate answers you are looking for
  
  User Groups
  
  Customer-organized groups that meet online and in-person. Join today to network, share ideas, and get tips on how to get the most out of Informatica
  
  Get Started
  
  Community Guidelines
- Knowledge Center
  
  Troubleshooting documents, product guides, how to videos, best practices, and more
  
  Troubleshooting documents, product guides, how to videos, best practices, and more
  
  Knowledge Base
  
  One-stop self-service portal for solutions, FAQs, Whitepapers, How Tos, Videos, and more
  
  Support TV
  
  Video channel for step-by-step instructions to use our products, best practices, troubleshooting tips, and much more
  
  Documentation
  
  Information library of the latest product documents
  
  Velocity (Best Practices)
  
  Best practices and use cases from the Implementation team
- Learn
  
  Rich resources to help you leverage full capabilities of our products
  
  Rich resources to help you leverage full capabilities of our products
  
  Trainings
  
  Role-based training programs for the best ROI
  
  Certifications
  
  Get certified on Informatica products. Free, Foundation, or Professional
  
  Product Learning Paths
  
  Free and unlimited modules based on your expertise level and journey
  
  Experience Lounge
  
  Self-guided, intuitive experience platform for outcome-focused product capabilities and use cases
- Resources
  
  Library of content to help you leverage the best of Informatica products
  
  Library of content to help you leverage the best of Informatica products
  
  Tech Tuesdays Webinars
  
  Most popular webinars on product architecture, best practices, and more
  
  Product Availability Matrix
  
  Product Availability Matrix statements of Informatica products
  
  SupportFlash
  
  Monthly support newsletter
  
  Support Documents
  
  Informatica Support Guide and Statements, Quick Start Guides, and Cloud Product Description Schedule
  
  Product Lifecycle
  
  End of Life statements of Informatica products
  
  Ideas
  
  Events
  
  Change Request Tracking
  
  Marketplace

Last Updated Date May 25, 2021 |

Stages Cloud Data Warehouse & Data Lake Best Practice

Challenge

In today’s digital world, data drives business. Data unlocks insights that deliver memorable customer experiences, identifies new revenue opportunities, proactively detects problems and, of course, drives greater profitability.

Businesses are experiencing a data deluge – an unprecedented increase in the volume, variety, and velocity of data that must be analyzed to extract full business value. Data exists in across many different silos, making it hard to analyze and draw conclusions. Traditional data warehouse platforms were designed for yesterday’s world when data was predictable and structured.

A data lake changes that. A data lake gives organizations the luxury of pooling all of its data so that it is accessible to any users at any time for any type of analysis. A data lake empowers workers to do more of their own data discovery and investigation across massive amounts of data, without requiring intervention from the IT team. Data analysts and data scientists can easily search, manipulate, and combine datasets for use in their studies. Business managers can make strategic decisions using the right data at the right time. Industrial organizations can automate maintenance and operational tasks for production systems. With a data lake, IT can give business leaders the tools they need to make decisions, while adding value to their existing data warehouse.

Description

Implementing a Data Lake

Implementing a data lake consists of several key steps: data acquisition and storage, enterprise data catalog, data governance and organization, data modification and manipulation, and finally, data operationalizing and publication.

1. Data Acquisition and Storage

Huge quantities of data may be ingested on a regular basis from enterprise applications, sensors and other devices, external feeds, and other analytic systems. Data is created at different volumes (amount of data), velocity (speed of data), variety of data (a range of structured, unstructured, and streaming data). Different users may want to use the data in different ways. For example, analysts and data scientists may load their own data, such as spreadsheets or external datasets. IT is usually responsible for loading corporate data assets that are applicable to multiple users or from applications that are particularly complex to access.

A data lake can ingest data from internal and external sources, regardless of format (files, database tables, application objects; XML and JSON; and from data providers via web services). Data may be acquired in batch, near real-time and real-time streaming:

Batch – A data analyst or scientist can use Informatica Developer Tool to create mappings that load simple datasets such as database tables or flat files (CSV). This can be a one-time occurrence or be scheduled for regular refresh.

Informatica Data Engineering Integration (DEI) can be used to access and integrate data; Informatica Data Engineering Quality (DEQ) for clean data; and Informatica Enterprise Data Catalog (EDC) or Informatica Axon Data Governance for catalog and to govern data.

Informatica BDM delivers high-throughput ingestion and data integration. Hundreds of prebuilt connectors, data integration transformations, and parsers enable virtually any type of data to be ingested and processed. It also provides dynamic mappings, dynamic scheme support, and programmatic automation of data integration.

Informatica BDQ enables organizations to enrich and standardize data at scale and proactively monitor the data quality process. It offers a set of role-based data discovery and profiling tools to quickly identify critical data problems hidden across the enterprise. It includes powerful tools for business analysts and developers alike.

An AI-powered data catalog, Informatica EDC provides a machine learning-based discovery engine to scan and catalog data assets across the enterprise. The CLAIRE™ engine provides intelligence by leveraging metadata to deliver intelligent recommendations, suggestions, and automation of data management tasks. It also provides data analysts and IT users with powerful semantic search. Axon Data Governance is integrated with Informatica Data Quality and Informatica Enterprise Data Catalog to enable a collaborative data governance program. Axon enables organizations to understand their data, measure and analyze their data, and connect easily across data governance stakeholders.

Near-real-time – Data can be ingested in near-real time from databases, enterprise applications (such as Salesforce or SAP), or messaging systems (such as Java Message Service or MQ Series).

Informatica Data Engineering Streaming (DES) or Informatica Edge Data Streaming (EDS) can be used. Informatica DES provides prebuilt, high-performance connectors such as Kafka, HDFS, Amazon Kinesis, NoSQL databases, and enterprise messaging systems and data transformations to enable a code-free method of defining your data integration logic. Informatica EDS collects and aggregates machine data, such as event logs, real-time logs, call detect records applications, syslog sources, and HTTP sources, and that data can be streamed into the data lake. Apache Kafka also can be used for stream processing.

Informatica PowerExchange Change Data Capture (CDC) targets Kafka or MapR streams with real-time integration of PowerExchange-supported CDC sources, which then feed Informatica BDS or BDM. CDC enables delivery of up-to-the-minute information. It recognizes business events, such as customer creation or order shipment data, and the captured stream of database activity can be delivered to multiple targets in real time. Event-driven data can be transformed and cleaned continuously.

Real-time streaming – Machine and sensor data, such as temperature, humidity, or air quality, can be streamed from the source and ingested directly into the data lake.

Informatica Vibe Data Stream for Machine Data (VDS) streams data at scale that’s generated in the form of flat or JSON files, HTTP or web sockets, TCP, UDP, syslog, or MQ Telemetry Transport. VDS minimizes the need for hand coding and reduces the time to develop new stream processing applications.

Informatica DES, Informatica DEI, or Kafka also may be used for real-time streaming. DES can scale out horizontally and vertically to handle petabytes of data. It provides prebuilt, high-performance connectors, enterprise connectors and data transformations for a code-free way to define the data integration logic.

2. Data Preparation and Provisioning

Informatica Enterprise Data Preparation (EDP) can be used to prepare and provision data. EDP enables raw big data to be systematically discovered so data analysts can find the information they’re looking for through semantic and faceted search, while automatically understanding data lineage and data relationships. An intuitive interface and built-in transformations make it easy to filter, aggregate, merge, and combine data. Enterprise Data Catalog (EDC) and Axon provide operational metadata and business vocabulary respectively required by EDP mapping to prepare recipes and datasets in EDP.

3. Storage

Hadoop is mostly commonly used for the data store, because it easily stores unstructured, semi-structured, and structured data. Hadoop is key for data lakes because it supports schema on-read, which means data is applied to a plan or scheme as it pulled out of a stored location, rather than as it goes in, as older database technologies did. Schema on-read enables a greater versatility in organizing data.

Commercial HDFS products are available from Hortonworks, Cloudera, MapR, and others. Organizations may choose to implement Hadoop in their private data centers, or leverage cloud services such as Amazon S3 or Microsoft Azure Data Lake.

Several processing frameworks can be used. MapReduce and Spark are used for general-purpose batch processing. With different frameworks, data can be processed at a higher abstractions, using SQL, for machine learning, or in real time for instance.

4. Enterprise Data Catalog

Organizations use a data catalog to discover and inventory corporate data assets across their enterprise. A data catalog essentially becomes a clearinghouse for all relevant information about corporate data assets. It maximizes the value and reuse of data across the enterprise.

An enterprise data catalog should automatically catalog and classify all types of data across the enterprise, giving organizations a complete picture of their data, including lineage, relationship view, data profiling, and quality.

Key steps in building a data catalog include:

Classify data assets – Data assets must be classified and organized across an organization’s cloud and on-premises systems. Data is categorized based on the types of information each data asset includes. Data may come from relational databases, mainframe applications, flat files or unstructured data, such as documents or health records. Good metadata management is critical to establish data lineage, ensure compliance, and maintain trusted data. Visibility into data lineage is critical know where the data comes from, how it has been transformed, and what applications and systems will consume it.

Informatica EDC is an AI-powered data catalog that automatically scans and catalogs data across the enterprise, indexing it for enterprise-wide discovery. It maintains end-to-end data lineage, so there’s a complete tracking of data movement to ensure data quality and protection. It leverages metadata to deliver intelligent recommendations and automat data management tasks. Intuitive search capabilities allow data assets to be found quickly and easily.

Mask sensitive data – Each data asset should be classified in relation to its level of sensitivity and need for protection. If data assets are determined to be sensitive, data masking can be applied to anonymize, tokenize, or encrypt data when it is delivered as well as when it’s stored in the data lake. This ensures secure access to sensitive information and to maintain compliance with data privacy mandates.

Informatica Dynamic Data Masking de-identifies data and controls access to customer service, billing, order management, customer engagement, and other applications. It masks or blocks sensitive information to users based on their role, location and privileges to prevent unauthorized access to sensitive information.

5. Data Governance

Data governance empowers the business in to use its data assets while ensuring that data is used in accordance with corporate and regulatory data privacy mandates. Governance starts with cataloging key data assets and understanding the data dictionary. This gives an organization a complete view of all the information for audit purposes, and makes it possible to start setting data policies for how the data will be used in the data lake.

Data governance requires:

Collaboration - A board or committee of data management professionals and business managers (who serve as data owners, stewards, and curators) is usually responsible for data governance policies. They establish and enforce policies that assure compliant data access, use, security, standards, and trust. Some policies are applied and enforced via people and process. Others can be automated by the metadata management, business rules, and logic of data management solutions to detect, alert, and remediate when a situation does not conform to policies.
Data quality - For the data lake to provide trusted insights, the data needs to be accurate, complete, reliable, relevant, and up-to-date. To ensure data quality, there must be a strong focus on consistency and standardization through validation with reference sets. Informatica BDQ enables organizations to take a holistic approach to managing data quality. It transforms data quality processes by delivering relevant, trusted data to all stakeholders.

Data mastering - Informatica MDM Relate 360 is used to master data and provide a complete and trusted view of your customers and uncover and infer non-obvious relationships from all your data sources. It creates enriched customer views, including customer transaction and interaction data. If more comprehensive master data management is required, Informatica Master Data Management (MDM) can ensure that data is clean and consistent before it is loaded into the data lake.

Be diligent about policy-based data governance. An undocumented and disorganized data store is nearly impossible to trust or use.

6. Modify and Manipulate Data

Now it’s time to consider how to organize, modify, and manipulate data.

Organize Data

A data lake is typically organized into zones for different uses:

Landing Zone – Where raw data is stored without being formatted or transformed. The landing zone may include basic formatting so encrypted or complex formats can be viewed by a person (e.g. converting an EBCDIC file into ASCII, or flattening XML or JSON files).
Discovery Zone – Where the data scientists refine the data. Data is stored in the discovery zone after advanced formatting, transformation, standardization, and validation. Data analysts or scientists can validate their hypotheses and test data combinations to obtain results for their research.
Consumption Zone – Where data is prepared for publication and consumption by other applications and business intelligence tools. Repeatable requirements must be identified for the consumption zone, to enable operationalization, consistency, and control.

Data can be further segregated into subzones based on business need to separate by business unit or to keep certain data separate or more secure to meet specific compliance or data privacy requirements.

Informatica DEI or Informatica CDC may be used in the landing zone. Informatica DEI can be used to access, integrate, clean, catalog and govern data. Prebuilt connectors, data integration transformations, and parsers enable virtually any type of data to be ingested and processed. It also provides dynamic mappings, dynamic scheme support, and programmatic automation of data integration.

Informatica CDC targets Kafka or MapR streams with real-time integration of PowerExchange-supported CDC sources, which then feed Informatica DES or DEI. CDC enables delivery of up-to-the-minute information.

Modify and Manipulate Data

There are several key steps for data modification and manipulation:

Business transformations – Data can be aggregated, sorted, formatted, and transformed. Creating efficient data transformations improves operational flows and responsiveness to customer requests.
Manage master data – It’s necessary to be able to match and refer to data across multiple records that that don’t have a common key identifier. Using advanced matching techniques, Informatica MDM can identify similar records among datasets and create persistent linkages to those records to obtain a complete view of the information present in the datasets. This can be done from within the data lake or by sending data to an external MDM application and then reloaded into lake.

7. Create Next Gen Analytics Models

Data lakes are most commonly used for next gen analytics and predictive decision modelling. The next step is to build the data models for analytics.

Discovery – Discovery tools like Informatica Enterprise Data Preparation (EDP) and Informatica EDC take data with high variety and look for qualitative or quantitative patterns. Informatica EDP enables raw big data to be systematically discovered, while Informatica EDC uses AI to automatically catalog and classify all types of data. Discovery tools include general or specialized search across unstructured data, such as sensor data and social media. The resulting analysis can include quantification of data, which is transferred to the analytics engine for further analysis or processing. It might also result in qualitative reporting, such as semantic mapping and geospatial visualization or network fault predictions.
EDW/OLTP – Analytics models can leverage data from the enterprise data warehouse (EDW) or OLTP systems, enabling organizations to leverage next-gen analytics on operational data and management reporting.
Real-time analytics - A real-time analytics component like Kafka or Informatica BDS/EDS can process streaming data.
- Sense, identify, and align incoming streaming data - Incoming data must be aligned to known sensors, customers, events or transactions to identify the context and user actions.
- Categorize, count, and focus - It’s critical to apply attributions in real time to the source data. These parameters are constantly updated based on deep analytics on historical data, which can be used to take action or make predictions.
- Capabilities to score and decide - Scoring models can be used for decision making in real time using streaming or historical data. Rule-based strategies also can be used.
NGA/predictive decision modeling - A next gen decision modeling engine provides ability to develop predictive models using historical and NRT data. The results are to optimize outcomes, comparing predictive models and choosing the most successful ones. Models can be executed in batch or through streaming data in real time.
Analytics engine - An MPP columnar database or a NoSQL DB like Apache HBase can run advanced queries to perform predictive modeling and visualization. An analytics engine performs these functions based on commands from predictive modeling and visualization tools. These commands are ultimately executed in a specialized MPP or NoSQL DB hardware environment that’s designed to deal with high volume data. Analytics engines carry typical functions for ELT, execution of predictive models and reports, and for any other data crunching jobs (for example, geospatial analysis).

8. Data Publishing and Operationalization

Once the data models are built, the data lake can be operationalized. In this step, the data and operations are handed to the IT team, who will ensure the datasets are available in an optimized, scalable way. After it is refined and ready to use, the data is published for the consumers, such as reporting, applications, and other business processes.

Data can be processes in multiple ways:

On-demand – On-demand delivery refers to the extraction required when a user refreshes a dashboard or report. The data can be accessed by querying Impala or Hive tables through a reporting tool with JDBC or ODBC connections. Data also can be consumed from the lake through web services such as a REST API.
Scheduled batch – A developer creates mappings to read the data from Hadoop (HDFS file, Hive table, and HBase table) and write the result to flat files, databases, or any other target types. The mapping execution can be scheduled to extract the data on a regular basis.
Event-Based – Informatica ActiveVOS enables stakeholders to quickly create BPMN2.0 compliant process models to seamlessly the data lake can be integrated with many different event services, such as message queues, web services, or even provide human-task workflow management so that delivery of the data can be activated externally from a generated event or based on a user action.

Recipes in Informatica EDP allow you to easily publish and operationalize data. The steps for data preparation can be recorded in recipes that can be used to automatically generate data flows that can be scheduled to operationalize insights.

Informatica BDM can be used to auto-generate mappings for data on-premises or in the cloud. The Informatica mapping can be processed by Blaze, Informatica’s native data management engine that runs as a YARN-based application. Blaze provides intelligent data pipelining, job partitioning, job recovery, and scaling. In Spark mode, the mappings are translated into Scala code and in Hive on MapReduce mode, so they can be executed natively on the Hadoop cluster.

9. Visualization

Analysts use third-party tools to visualize the analytics. This may include structured reports, dashboards, geospatial or semantic display of information, or simulation leading to actionable insights. Visualization techniques offer significant interactions so the analysts can break the data into smaller pieces.