In today’s digital world, data drives business. Data unlocks insights that deliver memorable customer experiences, identifies new revenue opportunities, proactively detects problems and, of course, drives greater profitability.
Businesses are experiencing a data deluge – an unprecedented increase in the volume, variety, and velocity of data that must be analyzed to extract full business value. Data exists in across many different silos, making it hard to analyze and draw conclusions. Traditional data warehouse platforms were designed for yesterday’s world when data was predictable and structured.
A data lake changes that. A data lake gives organizations the luxury of pooling all of its data so that it is accessible to any users at any time for any type of analysis. A data lake empowers workers to do more of their own data discovery and investigation across massive amounts of data, without requiring intervention from the IT team. Data analysts and data scientists can easily search, manipulate, and combine datasets for use in their studies. Business managers can make strategic decisions using the right data at the right time. Industrial organizations can automate maintenance and operational tasks for production systems. With a data lake, IT can give business leaders the tools they need to make decisions, while adding value to their existing data warehouse.
Implementing a data lake consists of several key steps: data acquisition and storage, enterprise data catalog, data governance and organization, data modification and manipulation, and finally, data operationalizing and publication.
Huge quantities of data may be ingested on a regular basis from enterprise applications, sensors and other devices, external feeds, and other analytic systems. Data is created at different volumes (amount of data), velocity (speed of data), variety of data (a range of structured, unstructured, and streaming data). Different users may want to use the data in different ways. For example, analysts and data scientists may load their own data, such as spreadsheets or external datasets. IT is usually responsible for loading corporate data assets that are applicable to multiple users or from applications that are particularly complex to access.
A data lake can ingest data from internal and external sources, regardless of format (files, database tables, application objects; XML and JSON; and from data providers via web services). Data may be acquired in batch, near real-time and real-time streaming:
Informatica Data Engineering Integration (DEI) can be used to access and integrate data; Informatica Data Engineering Quality (DEQ) for clean data; and Informatica Enterprise Data Catalog (EDC) or Informatica Axon Data Governance for catalog and to govern data.
Informatica BDM delivers high-throughput ingestion and data integration. Hundreds of prebuilt connectors, data integration transformations, and parsers enable virtually any type of data to be ingested and processed. It also provides dynamic mappings, dynamic scheme support, and programmatic automation of data integration.
Informatica BDQ enables organizations to enrich and standardize data at scale and proactively monitor the data quality process. It offers a set of role-based data discovery and profiling tools to quickly identify critical data problems hidden across the enterprise. It includes powerful tools for business analysts and developers alike.
An AI-powered data catalog, Informatica EDC provides a machine learning-based discovery engine to scan and catalog data assets across the enterprise. The CLAIRE™ engine provides intelligence by leveraging metadata to deliver intelligent recommendations, suggestions, and automation of data management tasks. It also provides data analysts and IT users with powerful semantic search. Axon Data Governance is integrated with Informatica Data Quality and Informatica Enterprise Data Catalog to enable a collaborative data governance program. Axon enables organizations to understand their data, measure and analyze their data, and connect easily across data governance stakeholders.
Informatica Data Engineering Streaming (DES) or Informatica Edge Data Streaming (EDS) can be used. Informatica DES provides prebuilt, high-performance connectors such as Kafka, HDFS, Amazon Kinesis, NoSQL databases, and enterprise messaging systems and data transformations to enable a code-free method of defining your data integration logic. Informatica EDS collects and aggregates machine data, such as event logs, real-time logs, call detect records applications, syslog sources, and HTTP sources, and that data can be streamed into the data lake. Apache Kafka also can be used for stream processing.
Informatica PowerExchange Change Data Capture (CDC) targets Kafka or MapR streams with real-time integration of PowerExchange-supported CDC sources, which then feed Informatica BDS or BDM. CDC enables delivery of up-to-the-minute information. It recognizes business events, such as customer creation or order shipment data, and the captured stream of database activity can be delivered to multiple targets in real time. Event-driven data can be transformed and cleaned continuously.
Informatica Vibe Data Stream for Machine Data (VDS) streams data at scale that’s generated in the form of flat or JSON files, HTTP or web sockets, TCP, UDP, syslog, or MQ Telemetry Transport. VDS minimizes the need for hand coding and reduces the time to develop new stream processing applications.
Informatica DES, Informatica DEI, or Kafka also may be used for real-time streaming. DES can scale out horizontally and vertically to handle petabytes of data. It provides prebuilt, high-performance connectors, enterprise connectors and data transformations for a code-free way to define the data integration logic.
Informatica Enterprise Data Preparation (EDP) can be used to prepare and provision data. EDP enables raw big data to be systematically discovered so data analysts can find the information they’re looking for through semantic and faceted search, while automatically understanding data lineage and data relationships. An intuitive interface and built-in transformations make it easy to filter, aggregate, merge, and combine data. Enterprise Data Catalog (EDC) and Axon provide operational metadata and business vocabulary respectively required by EDP mapping to prepare recipes and datasets in EDP.
Hadoop is mostly commonly used for the data store, because it easily stores unstructured, semi-structured, and structured data. Hadoop is key for data lakes because it supports schema on-read, which means data is applied to a plan or scheme as it pulled out of a stored location, rather than as it goes in, as older database technologies did. Schema on-read enables a greater versatility in organizing data.
Commercial HDFS products are available from Hortonworks, Cloudera, MapR, and others. Organizations may choose to implement Hadoop in their private data centers, or leverage cloud services such as Amazon S3 or Microsoft Azure Data Lake.
Several processing frameworks can be used. MapReduce and Spark are used for general-purpose batch processing. With different frameworks, data can be processed at a higher abstractions, using SQL, for machine learning, or in real time for instance.
Organizations use a data catalog to discover and inventory corporate data assets across their enterprise. A data catalog essentially becomes a clearinghouse for all relevant information about corporate data assets. It maximizes the value and reuse of data across the enterprise.
An enterprise data catalog should automatically catalog and classify all types of data across the enterprise, giving organizations a complete picture of their data, including lineage, relationship view, data profiling, and quality.
Key steps in building a data catalog include:
Informatica EDC is an AI-powered data catalog that automatically scans and catalogs data across the enterprise, indexing it for enterprise-wide discovery. It maintains end-to-end data lineage, so there’s a complete tracking of data movement to ensure data quality and protection. It leverages metadata to deliver intelligent recommendations and automat data management tasks. Intuitive search capabilities allow data assets to be found quickly and easily.
Informatica Dynamic Data Masking de-identifies data and controls access to customer service, billing, order management, customer engagement, and other applications. It masks or blocks sensitive information to users based on their role, location and privileges to prevent unauthorized access to sensitive information.
Data governance empowers the business in to use its data assets while ensuring that data is used in accordance with corporate and regulatory data privacy mandates. Governance starts with cataloging key data assets and understanding the data dictionary. This gives an organization a complete view of all the information for audit purposes, and makes it possible to start setting data policies for how the data will be used in the data lake.
Data governance requires:
Be diligent about policy-based data governance. An undocumented and disorganized data store is nearly impossible to trust or use.
Now it’s time to consider how to organize, modify, and manipulate data.
Organize Data
A data lake is typically organized into zones for different uses:
Data can be further segregated into subzones based on business need to separate by business unit or to keep certain data separate or more secure to meet specific compliance or data privacy requirements.
Informatica DEI or Informatica CDC may be used in the landing zone. Informatica DEI can be used to access, integrate, clean, catalog and govern data. Prebuilt connectors, data integration transformations, and parsers enable virtually any type of data to be ingested and processed. It also provides dynamic mappings, dynamic scheme support, and programmatic automation of data integration.
Informatica CDC targets Kafka or MapR streams with real-time integration of PowerExchange-supported CDC sources, which then feed Informatica DES or DEI. CDC enables delivery of up-to-the-minute information.
Modify and Manipulate Data
There are several key steps for data modification and manipulation:
Data lakes are most commonly used for next gen analytics and predictive decision modelling. The next step is to build the data models for analytics.
Once the data models are built, the data lake can be operationalized. In this step, the data and operations are handed to the IT team, who will ensure the datasets are available in an optimized, scalable way. After it is refined and ready to use, the data is published for the consumers, such as reporting, applications, and other business processes.
Data can be processes in multiple ways:
Recipes in Informatica EDP allow you to easily publish and operationalize data. The steps for data preparation can be recorded in recipes that can be used to automatically generate data flows that can be scheduled to operationalize insights.
Informatica BDM can be used to auto-generate mappings for data on-premises or in the cloud. The Informatica mapping can be processed by Blaze, Informatica’s native data management engine that runs as a YARN-based application. Blaze provides intelligent data pipelining, job partitioning, job recovery, and scaling. In Spark mode, the mappings are translated into Scala code and in Hive on MapReduce mode, so they can be executed natively on the Hadoop cluster.
Analysts use third-party tools to visualize the analytics. This may include structured reports, dashboards, geospatial or semantic display of information, or simulation leading to actionable insights. Visualization techniques offer significant interactions so the analysts can break the data into smaller pieces.