Informatica Data Engineering Integration (Big Data Management) delivers high-throughput data ingestion and data integration processing so business analysts can get the data they need quickly. Hundreds of prebuilt, high-performance connectors, data integration transformations, and parsers enable virtually any type of data to be quickly ingested and processed on big data infrastructure, such as Apache Hadoop, NoSQL, and MPP appliances. Beyond prebuilt components and automation, Informatica provides dynamic mappings, dynamic schema support, and parameterization for programmatic and templatized automation of data integration processes.
The Advanced Level will help you develop expertise in DEI. It constitutes of many videos, documents, and articles that will take you through performance tuning, monitoring, and troubleshooting in Sqoop, Blaze, Spark, DIS and more.
After completing this level, you will earn an Informatica Badge for Data Engineering Integration (Big Data Management). So continue with your product learning and earn your badge!
In this module, you learned about Sqoop performance tuning, SPARK performance tuning, Data Integration Service Tuning, and Model Repository Service Tuning. You also got a better understanding of monitoring Blaze engine, tab settings, configuring the Monitoring Settings to enable MRS to store Historical Data, using complex datatypes on Spark - Arrays, complex datatypes on Spark - Struct, ML-Based Parsing and Intelligent Structure Discovery, complex datatypes on Spark - Nested port and Machine Learning with Python transformation in DEI.
You have also learned about REST operations hub in DEI and dealing with executing REST queries to gather monitoring statistics, Spark history server, stateful computing in Spark, deployment automation, Creating/Deploying an application, Mass Ingestion into Amazon S3 using Data Engineering Integration and Troubleshooting, consisting of Analyst service, search service, DIS on Grid, Monitoring tool, Spark, and Blaze.
You have successfully completed all three levels of DEI onboarding and have earned your badge!
In this video, you will learn how to tune SQOOP for a sample Informatica mapping by initially setting the mappers count followed by a demo.
This article provides sizing recommendations for the Hadoop cluster and the Informatica domain, tuning recommendations for various DEI components, best practices to design efficient mappings, and troubleshooting tips.
This article is intended for DEI users, such as Hadoop administrators, Informatica administrators, and Informatica developers.
This video will help you understand how to improve the performance of the SPARK engine. You will also learn how to set up the client site configuration properties of SPARK engine that will enable dynamic memory allocation, followed by a demo.
As Data Integration Service is responsible for the main execution of jobs, it is important to tune certain parameters like the maximum heap size for optimum performance. In this video, you will learn how you can tune different properties related to the performance of Data Integration Services, followed by a demo.
In this video, you will learn how to access and edit the heap size properties of the Model Repository Service. The maximum heap size can be configured to increase the performance of these services.
In this video, you will learn how to select Blaze as a run time engine to run the mapping, followed by a demo.
Informatica Administrator has a monitoring capability that enables you to view statistics and log events for mappings that run in the Hadoop environment. In this video, you will see how you can use the monitoring feature to view statistics of various jobs that you run in the environment, followed by a demo.
In this video, you will learn how you can configure the monitoring settings to enable Model Repository Service to store historical data, followed by a demo. After you configure monitoring settings, the Model Repository stores runtime statistics about the objects that are run in the Data Integration Service.
Learn how to interact with Complex types (Arrays) using DEI 10.2.1 with the help of a demo
Learn how to interact with Complex types (Structs) using DEI 10.2.1 with the help of a demo.
Learn how to use Intelligent Structure Discovery to leverage machine learning for processing complex file formats and use the models in Big Data Management 10.2.1 for operationalization.
Learn how to interact with Nested Complex types such as Array of Structs using DEI 10.2.1.
This video showcases a demo on how to leverage machine learning with Python transformation in DEI. Python transformation is supported in DEI with Spark execution mode starting 10.2.1 (2018 Spring release).
REST queries can be used to gather monitoring statistics, advanced statistics as well as execution steps. Learn how to execute REST queries to gather monitoring statistics.
The video introduces you to the SPARK history server, its benefits, steps to keep it always running, and common issues and resolution.
Stateful computing brings the ability to perform cross-row operations (such as current-row vs previous-row) using Spark's Window functions.
In this video, learn how to perform stateful computing using SPARK in DEI 10.2.
In this video, you would learn how to integrate DEI deployments using version control systems such as Git/Bitbucket, ticketing systems such as JIRA, and orchestration systems such as Jenkins.
Learn how to build completely automated CI/CD pipelines and improve your overall DevOps lifecycle.
Create and deploy an application that contains mappings, workflows and other application objects to make the objects accessible to users that want to leverage the data outside of the Developer tool. You can deploy the application to a Data Integration Service to run the objects, or to an application archive file to save a copy of the application and deploy it to a Data Integration Service later.
If you make changes to an application object in the Developer tool, you can update the deployed application. This article describes how to create, deploy, and update an application.
Git Version Control
You can integrate the Model Repository with the Perforce, Subversion, or Git version control systems. This article discusses how to integrate Git as a version control system for the Model repository in version 10.2 HotFix 1.
This video demonstrates the use of Mass Ingestion service to ingest relational data into Amazon S3. Mass Ingestion service allows users to ingest thousands of relational tables into HDFS, Hive, Amazon S3, Azure Blob, and Azure ADLS without writing a single mapping.
The mass ingestion specification can be designed, developed, deployed, executed, and monitored from a single web-based interface without having to open the Informatica Developer ever.
This video discusses the Analyst Service startup issues and how to troubleshoot them. This includes Analyst Tool Overview, Finding the logs, Common startup issues overview, and Troubleshooting guidelines.
This video discusses search service basics and connectivity, search index creation, search results and performance, common issues, and troubleshooting tips.
This video discusses the steps to identify job logs for a job run by DIS on a GRID in Informatica Data Quality 10.x. The video will also go through the steps to collect logs for Data Quality jobs like Mapping/ Workflow when Data Integration Services are running on GRID setup, options available to fetch the job logs from available Informatica Data Quality (IDQ) Clients and identify the node on which the job has been dispatched.
This video discusses the monitoring tool and hang scenarios. This includes monitoring service architecture, monitoring tool hangs troubleshooting issues, known issues, and recommended configurations.
This video will take you through Spark architecture, Spark integration with BDM, Spark shuffle, Spark dynamic allocation, the journey from Hive, Blaze to Spark, Spark troubleshooting and self-service, Spark monitoring and other references.
This video provides an overview of Blaze architecture and components. It also discusses the Blaze configuration, Logs location, and collection common issues, and troubleshooting tips.
This webinar is going to describe on the ephemeral cluster task available in Informatica’s DEI product. It gives an overview of this feature and how it works with cloud ecosystems like Azure, AWS, etc.
Informatica is constantly innovating tools/systems to help you leverage the full capabilities of our products. This section will provide you the list of all such tools supporting DEI products.
- InfaDump: It is a Shell script to collect the jstack, pmstack, and heapdump on running process. Helps to collect dumps at regular intervals for 'n' iterations. Click here to view the document.
- pmstack: It is a tool to capture the native thread stacks on dtm/ Java process or any process running on Linux/AIX/Solaris server. Click here to view the document.
- PlatformLogCollector: This Java based tool collects the Logs from the Informatica server machine. Click here to view the document.
- InfaLogs: InfaLogs is a tool that can be used to collect application service logs written to file system, log service agent, workflow, and session logs. Click here to view the document.
- ssgodbc: It is an interactive query tool for testing ODBC connection. You can enter a SQL statement, such as a SELECT query, and view the results. You can execute data definition language (DDL) statements to create tables and other objects. Click here to view the document.
- IA4J: In a tight time bounded production server, it is inconvenient to get down time and debug the issue. The IA4J tool gets you what you want with ZERO downtime. Click here to view the document.
- ysplit: This tool helps you split the Yarn logs into multiple files (based on container/log types) and also builds an HTML index to quickly navigate through these files. Click here to view the document.
- KUtil: It is a program used to run other programs in Kerberos context. Note that KUtil honors the conf properties loaded into it and will not run the Child utility/program in Kerberos context if the property hadoop.security.authentication != kerberos. Click here to view the document.
- sysmon: System Health Tracker: There are various system tools provided by Linux server to capture different system diagnostics information. This is a wrapper to invoke needed commands that collects various diagnostics information from Linux server. Click here to view the document.
- InfaCoreFileDepPackager: This tool helps in collecting the dependent libraries associated with the coredump generated from an Informatica process. Click here to view the document.