Informatica Data Engineering Integration (Big Data Management) delivers high-throughput data ingestion and data integration processing so business analysts can get the data they need quickly. Hundreds of prebuilt, high-performance connectors, data integration transformations, and parsers enable virtually any type of data to be quickly ingested and processed on big data infrastructures, such as Apache Hadoop, NoSQL, and MPP appliances. Beyond prebuilt components and automation, Informatica provides dynamic mappings, dynamic schema support, and parameterization for programmatic and templatized automation of data integration processes.
The Intermediate level of product learning constitutes of many 'how-to' documents, whitepapers, and videos that will help you in DEI product adoption for administrators and developers. It will also discuss about Data Engineering Streaming (Big Data Streaming), renaming nodes and domains, security, impersonation, authentication, and working on other ecosystems like Sqoop, Avro, Hive, Teradata, and Hadoop.
After you successfully complete all the three levels of Data Engineering Integration (Big Data Management) product learning, you will earn an Informatica Badge for Data Engineering Integration (Big Data Management). So let's continue on your learning journey!
This module covered various administrative tasks such as installation and configuration of the distribution package, security mechanism set up, and performance tuning, helped you change the name of an existing domain, backup of an existing node, create a new node, process of domain metadata export and import and helped you reuse existing PowerCenter applications for big data workloads, generate a report to assess the effort in the journey to big data world and use the import utility to seamlessly import the PowerCenter mappings.
This module also discussed security for both authentication and authorization, including Kerberos authenticated environment, permissions and privileges on Ranger, principles of Kafka, installation, operations. You then learned about the deployment of Kafka cluster and DEI basic components- Spark SQL, Hive Query Language, Sqoop, Hive, Teradata connector integration, Oracle Sqoop connectivity.
You also came to know about cluster configuration privileges and roles to determine the actions that users can perform, understood the cluster configuration views, integrate DEI to Hadoop cluster and helped you execute hadoop commands from DIS NODE, which is not an Edge node.
Now move on to the Advanced level for your DEI onboarding and get to know more about the product.
Here is a video that introduces the basic concepts of Spark SQL. The video covers the following topics:
- Spark - Introduction
- Spark - RDD
- Spark - Installation
- Spark SQL - Introduction
- Spark SQL - DataFrames
- Spark SQL - Data Sources
This video discusses the basic concepts of Hive Query Language (HiveQL) :
- HiveQL - Select Where
- HiveQL - Select Order By
- HiveQL - Select Group By
- HiveQL - Select Joins
This video introduces the basic concepts of the Hive:
- Data types
- Create the database
- Drop the database
- Create the table
- Alter table
- Drop table
- Built-in-functions and views
This video is a brief tutorial on how to make use of Sqoop in the Hadoop Ecosystem. It covers Sqoop - Introduction, Installation, Import, Import-All-Tables, Export, Sqoop Job, Codegen, Eval, List Database and List Table.
This video explains how to configure Sqoop for Oracle database in DEI.
This video provides an introduction to Amazon S3 FileName Port Feature in DEI 10.4. The video also discusses the FileName Port feature's implementation in DEI 10.4 with a quick demo.
This video provides a brief introduction to Data Preview and then discusses the workings of Data Preview on Spark, Data Preview Logs, its rules and guidelines, and more.
Introduction to JDBC V2 Connector in DEI 10.4
This video covers the basics of JDBC V2 Connectors, their supported features, the difference between Sqoop and JDBC V2 connectors, and future enhancements.
Introduction to JDBC V2 Connection on Databricks
This video discusses JDBC V2 connections and their properties. With the help of a demo, you will discover how to create a JDBC V2 connection and run a mapping on the Databricks cluster.
This video is a brief tutorial on how to apply Sqoop in the Hadoop Ecosystem. It covers Sqoop - Introduction, Installation, Import, Import-All-Tables, Export, Sqoop Job, Codegen, Eval, List Database, and List Table.
This video explains how to configure a Sqoop mapping to honor the Owner Name Field for IBM DB2, IBM DB2 z/OS, Oracle, Teradata, and Microsoft SQL Server Databases in Data Engineering Integration.
This video provides a brief introduction to Sqoop boundary queries, and it explains the difference between split by and Sqoop boundary query.
Introduction: Filename Port in Complex Files
This video provides an introduction to complex files. It also explains Filename Port on Source and Target Complex file data objects.
How to Use Wildcard Characters for Reading Data from Complex Files
This video discusses Wildcard Characters for reading data from complex files, a list of supported wildcard characters, followed by a quick demo.
This video explains how to configure the Cluster Configuration Object (CCO) in the Informatica DEI. It covers steps to export CCO from the admin console, supported by a demo.
This video, supported by a demo, guides you through the steps required to enable high precision for pushdown mappings.
Introduction: Python Transformation on Databricks
This video discusses Azure Databricks, provides an overview of python transformation, and explains active and passive Python transformations.
Python Transformation in Streaming Mappings
Effective in version 10.5, this document explains how you can add Python transformation to streaming mappings in the Databricks environment on the AWS or Azure platforms.
This video provides a brief introduction to the Ephemeral cluster, guides you through the steps to persist the application log, helps you collect logs from the Azure portal, and helps you get the Spark application log.
This video provides an introduction to Ephemeral Cluster and guides you to create a Hive connection and steps to use the Hive connection for a Hive Target.
In version 10.5, you can use Kudu as a target in streaming mappings. Kudu is a columnar storage manager developed for the Apache Hadoop platform.
Refer to these documents for more information:
Watch this video to understand the steps for creating a Kudu connection for DEI jobs.
Here's a video that explains the steps for generating Kudu tables via DEI jobs.
Watch this video to learn how to enable verbose logging for Sqoop objects in DEI mappings.
Effective in version 10.5.2, you can configure mappings to write to unmanaged Databricks table targets.
Mappings can access unmanaged Databricks tables built on top of any of the following storage types:
- Azure blob storage
- Azure Data Lake Storage (ADLS) Gen1 or Gen2
- Amazon Web Services (AWS) S3
Click on this link to learn more.
The Delta Lake Schema Evolution feature has been added in DEI 10.5.2.
Schema enforcement and evolution enable you to manage changes in a Databricks table schema. You can choose different strategies to manage schema changes.
Schema enforcement monitors changes to Databricks table schemas and rejects changes that do not match the target table schema.
When Databricks rejects changes, it cancels the write transaction and logs an exception. If you determine that you want to incorporate new columns in the target, schema evolution enables you to add them to the target in a controlled fashion. Schema evolution is also known as schema validation.
To use schema evolution, you need to disable schema enforcement in the target Databricks workspace.
Click on this link to learn more.
Effective in version 10.5, you can use Google PubSub as a source in streaming mappings. This document will teach how to use the Google PubSub source to read messages from the configured Google Cloud PubSub subscription.
Go through the document to gain information about TDCH integration with Data Engineering Integration (Big Data Management). On completion, you would be able to understand how to use and configure TDCH with Data Engineering Integration (Big Data Management).
An intelligent structure model can be used in a Confluent Kafka, Kafka, Azure Event Hubs, and Amazon Kinesis Stream data object for streaming mappings that run on the Databricks Spark engine.
Refer to this link to learn more.
This tutorial explores the principles of Kafka- installation and operations and walks you through the deployment of the Kafka cluster. The video concludes with real-time applications and integration with Big Data Technologies.
This tutorial has been prepared for professionals aspiring to make a career in Big Data Analytics using Apache Kafka messaging system.
Pre-req: You must have a good understanding of Java, Scala, Distributed messaging system, and Linux environment.
Effective in version 10.5.1, a few new transformations can be used in the streaming mappings that run in the Databricks environment on Azure and AWS platforms:
This video will help you understand Informatica Data Engineering Streaming (Big Data Streaming) to prepare and process streams of data in real-time and uncover insights in time to meet your business needs.
It also gives a brief introduction to Kafka Architecture, License options for DES, prerequisites for running DES mapping followed by a demo.
This video showcases DEI 10.2 SPARK demo.
Install DEI using Docker
Refer the following links:
Installing DEI (10.4.0 – 10.4.1) on Docker using Container Utility
Installing DEI (10.4.x - 10.5) on Docker using Informatica Deployment Manager
Configure YARN in DEI
Learn how to configure YARN in DEI. This article enables you to manage resources allocation to the jobs that run in the Hadoop environment.
You could manage resources using YARN schedulers, YARN queues, and node labels. This article describes how you could define and use schedulers, queues, and node labels.
Click here to read the article
Metadata Access Service Reference
DEI makes use of various application services to access and integrate data from the Hadoop environment at design and run time.
This article explains Metadata Access Service used to access and preview metadata from the Hadoop environment at design time.
Click here to view the article
Configure DEI for SSL Enabled Cluster
This article describes the architecture and advantages of migrating your big data solution to Amazon AWS, how to implement the one-click DEI deployment, and how to implement ephemeral clusters.
DEI for Administrators
This course provides a scope for the administrators to set up a live DEI environment by performing various administrative tasks such as installation and configuration of the distribution package, security mechanism set up, and performance tuning. It trains administrators in setting up and management of the security mechanisms.
The course gives you a scope to integrate the Informatica domain with the Hadoop eco-system leveraging Hadoop’s lightning processing capability to churn huge data sets.
Register for this OnDemand Training for DEI for Administrators
Big Data for Developers
Discover how to leverage Informatica Big Data Management for the optimization of data warehousing by offloading data processing to Hadoop.
Register for this OnDemand Training for Big Data for Developers
Informatica Developer Tool for Big Data Developers
Learn the mechanics of Data Integration using the Informatica Developer tool for Big Data Development. This course takes you through the key components to develop, configure, and deploy data integration mappings.
Register for this OnDemand Training for Informatica Developer tool for Big Data Developers
Data Engineering Streaming (DES) for Developers
Gain the skills necessary to execute end-to-end big data streaming use cases. Learn to prepare, process, enrich, and maintain streams of data in real-time using Informatica, Edge, Kafka, and Spark.
This video discusses how to change the name of an existing domain. It includes taking a backup of the existing nodemeta.xml, running infasetup updateDomainName, running updategatewaynode command, and performing a tomcat cleanup.
This video shows you how to change the node name of an existing node in PowerCenter 9.6.1 and PowerCenter 10.2. It discusses how to take a backup of an existing node, create a new node from the Informatica Administrator Console, run definegatewaynode command to change to a new node name, perform a tomcat cleanup and restart services.
This video is a demo about how to configure Lightweight Directory Access Protocol (LDAP) authentication in an Informatica domain. It discusses supported LDAP servers, connecting using LDAP client/LDAP data overview, configuring the Informatica domain with LDAP and other useful information.
This video demonstrates the process of domain metadata export and import. The video also explains Infasetup, Backup Domain, Restore Domain, FAQs followed by a demo.
The video demonstrates how easy it is to take existing PowerCenter applications and mappings and redeploy them into the Big Data world. The video discusses what PowerCenter Reuse report will identify followed by a quick demo.
Here is an introduction to Cluster Configuration Objects (CCO).
The video elaborates on using DEI with Hadoop Cluster. Use cluster configuration privileges and roles to determine the actions that users can perform in the Administrator tool and the infacmd command line program.
In this video, you will learn how you can view, edit, and assign permissions on a cluster configuration.
This video will help you understand how to integrate DEI with Hadoop Cluster.
In this video, you will see learn more about DEI performance tuning - how to create cluster configuration objects.
Understand security for both authentication and authorization, including Kerberos authenticated environment, permissions, and privileges on Ranger, and mapping using the User Impersonation method of authorization.
This video discusses impersonation and OS Profile in DEI. This includes understanding what impersonation is, the need to impersonate, how to configure impersonation, OS profile in DEI, configuring OS profile, and some common issues.
This video takes you through the Kerberos configuration for DEI. This video helps you to understand/handle any issues related to Kerberos configuration for DEI, have a clear understanding of the capabilities (high-level) of Kerberos, and how it is being used by Informatica, learn how to troubleshoot and narrow down whether the issue is related to customer configuration or product-related issue.
You can use a Kerberos-enabled MapR cluster (version 5.1) with DEI to run mappings using the Native, Hive and Blaze execution engines. This video shows you how to configure DEI to communicate with a Kerberos-enabled MapR cluster.
You can configure Hive to use LDAP authentication on Cloudera CDH and Hortonworks HDP clusters. This article discusses how Data DEI integrates with the authentication mechanisms of the Hadoop cluster and Hive.
Two-factor authentication, utilizing smart cards or USB tokens, is a popular network security mechanism. This article explains how two-factor authentication works in an Informatica domain configured to use Kerberos authentication. The information in the article might also be useful when troubleshooting authentication issues.
Create a data flow design in Informatica Administrator by adding source services and target services to a data flow. You can connect the two services and add any transformations that you want to apply.
Monitor the EDS data flow to check if the data (in bytes) is being sent and received by the sources and targets created.
Convert unstructured data to structured data using EDS transformations such as an Unstructured Parser. You can then ingest this clean data to BDS for further processing.
Create a Kafka data object to read from a Kafka broker. Use the HDFS data object as a target and run the mapping in Spark mode.
Create a Kafka data object to read from a Kafka broker. Use the Kafka data object as a target and run the mapping in Spark mode.
Access the Hadoop cluster and ensure that the streaming data is being displayed for your analysis
Enables to mass ingest objects in bulk without any manual effort.
Merge records into a pre-created hive table by creating mapping using Update Strategy transformation.
You can also use aggregate functions as window functions in an Expression transformation.
Process Avro files and store them in different targets.
Read a SQOOP data object and use an Aggregator transformation to create an Array port. You will then write the data to an Avro file.
Propagate data type precision from source to target to improve performance.
Data Engineering Management on Azure Marketplace
Watch this video to learn how Informatica Data Engineering Management on Azure Marketplace helps to solve the “Data Lake on cloud” use case in the Azure ecosystem followed by a demo.
Webinar: Data Engineering Management on Azure Marketplace
Data Engineering Management on Amazon Cloud
Watch this video to learn how Informatica Big Data management helps to solve the “Data Lake on cloud” use case in the Amazon ecosystem followed by a demo.
Webinar: Data Engineering Management on Amazon Cloud
Meet the Experts: Deep Dive and Demo: BDM and BDS for 10.2.2 Release
Watch a deep dive of the new AI-driven big data management and streaming innovations: automatic schema drift handling, so you can adapt to frequent changes in batch and streaming data without impacting processing; advanced Spark support including Python Tx and Spark Structured Streaming; and predictive insights into big data jobs for capacity planning on hybrid, cloud, and on-premises environments.
Webinar: Meet the Experts: Deep Dive and Demo: BDM and BDS for 10.2.2 Release
Informatica Data Engineering Management and CLAIRE
Watch this demo to see how the Informatica CLAIRE™ Engine uses AI and ML to accelerate all stages of intelligent data lake management.
Webinar: Informatica Data Engineering Management and CLAIRE
Mass Ingestion into Amazon S3 using Data Engineering Management
Video demonstration of Informatica’s Mass Ingestion service to load relational data into Amazon S3.
Webinar: Mass Ingestion into Amazon S3 using Data Engineering Management
Meet the Experts: CI/CD and DevOps for Big Data Management
Learn more about what's new in Data Engineering Integration, including leveraging version control systems like Git, invoking Informatica BDM processes from open source technologies like Jenkins, and using concurrency, stability, and other operationalization enhancements.
Webinar: Meet the Experts: CI/CD and DevOps for Big Data Management
Here is a set of guides and articles that explains how to set up Informatica DEI 10.5.0. Refer to the documents based on the respective versions.
This video discusses how to upgrade to DEI 10.2.2.
This video explains how to upgrade Informatica DEI from 10.1.1 to 10.4. (Part 1)
This video explains how to upgrade Informatica DEI from 10.1.1 to 10.4. (Part 2)
This video helps you understand Hadoop commands from DIS.
Here are the Troubleshooting guides:
Data Engineering Streaming Troubleshooting Guide: This guide will help to debug Data Engineering Streaming job issues.
Spark Troubleshooting Guide: This guide will help debug Spark Mapping jobs failure and also tune the performance of mappings.
Databrick Troubleshooting Guide: This guide will help to debug Databricks related issues.