Lake System scanners such as Amazon S3, ADLS Gen2 and Google Cloud Storage supports profiling below file types:
- Avro
- Parquet
- CSV
Data profiling in CDGC involves evaluating the quality of metadata extracted from the respective source system. Running a profile on Avro or Parquet files in CDGC requires an advanced cluster configuration due to the complexity of these file formats. Avro and Parquet are optimized for big data processing, with Avro being row-based and Parquet being columnar.
An advanced cluster is a Kubernetes cluster which is crucial for profiling complex file types (Avro and Parquet) in CDGC due to its distributed processing capabilities. These clusters manage large-scale data processing efficiently by distributing tasks across multiple nodes. This setup is essential for handling the complexities and sizes of Avro and Parquet files, enabling efficient data analysis.
- Basic knowledge of catalog source configuration in MCC and its capabilities (say Metadata Extraction, Data Profiling)
- Access to create catalog source in MCC and viewing the results in DGC (Data Governance and Catalog)
- Knowledge of Linux environment and configuring secure agent.
- Only Linux Secure Agent is supported to configure advanced clusters.
- To use an advanced cluster, the secure agent group should contain a single secure agent.
- A service named "Elastic Server" is also required, along with the other services needed for CDGC profiling.
- Ensure to choose the "Elastic Runtime Environment" and "Staging Connection" under the data profiling section. This is applicable for data profiling on complex file types (Avro and Parquet).
- Learn how to setup an advanced cluster to profile AVRO and Parquet file in CDGC based on the business requirement.
- Learn how to configure data profiling for complex file types
- Gain comprehensive understanding of setting up an advanced cluster and use it to profile complex file types such as Avro and Parquet in CDGC.
- Administrator
- Governance User
- Governance Administrator
Ask An Expert
Feature Clarity
Cloud Data Governance and Catalog
Ask An Expert
Configure
Implement
Adoption - Technical
Functional
AAE-CDGC-029
Disclaimer
- All the topics covered in the Success Accelerators/Ask An Expert sessions are intended for guidance and advisory only. This is implicit and it will not be called out under the scope of each engagement.
- Customers need to include their relevant technical/business team members highlighted in each engagement topic to derive the best out of each engagement.
- Customers need to perform any hands-on work by themselves leveraging the guidance from these engagements.
- Customers need to work with Informatica Global Customer Support for any product bugs/issues and troubleshooting.