Building Data Transformation (DT) Services

Success

Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs

Success

Success Accelerators

Accelerate your Purchase to Value engaging with Informatica Architects for Customer Success

My Engagements

All your Engagements at one place
Communities

A collaborative platform to connect and grow with like-minded Informaticans across the globe

Communities

Product Communities

Connect and collaborate with Informatica experts and champions

Discussions

Have a question? Start a Discussion and get immediate answers you are looking for

User Groups

Customer-organized groups that meet online and in-person. Join today to network, share ideas, and get tips on how to get the most out of Informatica

Get Started

Community Guidelines
Knowledge Center

Troubleshooting documents, product guides, how to videos, best practices, and more

Knowledge Center

Knowledge Base

One-stop self-service portal for solutions, FAQs, Whitepapers, How Tos, Videos, and more

Support TV

Video channel for step-by-step instructions to use our products, best practices, troubleshooting tips, and much more

Documentation

Information library of the latest product documents

Velocity (Best Practices)

Best practices and use cases from the Implementation team
Learn

Rich resources to help you leverage full capabilities of our products

Learn

Trainings

Role-based training programs for the best ROI

Certifications

Get certified on Informatica products. Free, Foundation, or Professional

Product Learning Paths

Free and unlimited modules based on your expertise level and journey

Experience Lounge

Self-guided, intuitive experience platform for outcome-focused product capabilities and use cases
Resources

Library of content to help you leverage the best of Informatica products

Resources

Tech Tuesdays Webinars

Most popular webinars on product architecture, best practices, and more

Product Availability Matrix

Product Availability Matrix statements of Informatica products

SupportFlash

Monthly support newsletter

Support Documents

Informatica Support Guide and Statements, Quick Start Guides, and Cloud Product Description Schedule

Product Lifecycle

End of Life statements of Informatica products

Ideas

Events

Change Request Tracking

Marketplace

| Sign up

Velocity
Strategy

Strategy

Data Strategy

Centers of Excellence

Enterprise Data Governance

Enterprise Architecture

Program & Change Management
Solutions

Solutions

Cloud Data Warehouse & Data Lake

Data Lake

Data Warehouse Modernization

Analytics Modernization

Application Integration

360 Engagement

Multidomain MDM

Customer 360 SaaS

Product 360

Supplier 360

Reference 360

Data Governance & Privacy

Data Catalog & Metadata Management

Data Privacy

Regulatory Compliance

Data Quality

Data Access and Provisioning
Stages

Stages

Cloud Data Warehouse & Data Lake

360 Engagement

Data Governance & Privacy

Following a rigorous methodology is key to delivering customer satisfaction and expanding analytics use cases across the business.
More
- Success
  
  Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs
  
  Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs
  
  Success Accelerators
  
  Accelerate your Purchase to Value engaging with Informatica Architects for Customer Success
  
  My Engagements
  
  All your Engagements at one place
- Communities
  
  A collaborative platform to connect and grow with like-minded Informaticans across the globe
  
  A collaborative platform to connect and grow with like-minded Informaticans across the globe
  
  Product Communities
  
  Connect and collaborate with Informatica experts and champions
  
  Discussions
  
  Have a question? Start a Discussion and get immediate answers you are looking for
  
  User Groups
  
  Customer-organized groups that meet online and in-person. Join today to network, share ideas, and get tips on how to get the most out of Informatica
  
  Get Started
  
  Community Guidelines
- Knowledge Center
  
  Troubleshooting documents, product guides, how to videos, best practices, and more
  
  Troubleshooting documents, product guides, how to videos, best practices, and more
  
  Knowledge Base
  
  One-stop self-service portal for solutions, FAQs, Whitepapers, How Tos, Videos, and more
  
  Support TV
  
  Video channel for step-by-step instructions to use our products, best practices, troubleshooting tips, and much more
  
  Documentation
  
  Information library of the latest product documents
  
  Velocity (Best Practices)
  
  Best practices and use cases from the Implementation team
- Learn
  
  Rich resources to help you leverage full capabilities of our products
  
  Rich resources to help you leverage full capabilities of our products
  
  Trainings
  
  Role-based training programs for the best ROI
  
  Certifications
  
  Get certified on Informatica products. Free, Foundation, or Professional
  
  Product Learning Paths
  
  Free and unlimited modules based on your expertise level and journey
  
  Experience Lounge
  
  Self-guided, intuitive experience platform for outcome-focused product capabilities and use cases
- Resources
  
  Library of content to help you leverage the best of Informatica products
  
  Library of content to help you leverage the best of Informatica products
  
  Tech Tuesdays Webinars
  
  Most popular webinars on product architecture, best practices, and more
  
  Product Availability Matrix
  
  Product Availability Matrix statements of Informatica products
  
  SupportFlash
  
  Monthly support newsletter
  
  Support Documents
  
  Informatica Support Guide and Statements, Quick Start Guides, and Cloud Product Description Schedule
  
  Product Lifecycle
  
  End of Life statements of Informatica products
  
  Ideas
  
  Events
  
  Change Request Tracking
  
  Marketplace

Last Updated Date Aug 19, 2022 |

Stages Cloud Data Warehouse & Data Lake Best Practice

Challenge

Data Transformation is an application that processes complex files, such as messaging formats, HTML pages, and PDF documents. Data Transformation also transforms formats such as ACORD, HIPAA, HL7, EDI-X12, EDIFACT, AFP, and SWIFT. It is a comprehensive tool for transforming data from a variety of hierarchical, semi-structured and unstructured formats to virtually any other format.

Required components that need to be built as part of a Data Transformation services include Streamers, Parsers, Mappers, and Serializers. To create any-to-any transformations, components may (and should) be chained. A Parser normally covers any format to XML. A Mapper covers one representation of XML to another, and a Serializer covers XML to any end-format. Streamers are normally used on the front-end to break down large files, although they can be used on the backend as well. Steamers are available for textual or XML structures. The number of components needed to properly build the DT service depends upon the source data and its complexity.

Description

Data Transformation (DT) Architecture

Create a Data Processor transformation in the Developer tool. Save the transformation in the Model repository.
Export the Data Processor transformation as a Data Transformation service. Export the service to the Data Transformation repository.
An application can be deployed that contains a Data Processor transformation to a Data Integration Service.
The Data Integration Service runs the application and calls the Data Processor Engine to process the transformation logic.

Data Transformation Development Environment

Data Transformation installs by default when you install Informatica Developer (the Developer tool).

The Data Processor transformation has multiple views that you access when you configure the transformation and run it in the Developer tool. Some of the Data Processor transformation views do not appear in the Developer tool by default. To change the views for the transformation, click Window > Show View > Other > Informatica.

The Data Processor transformation has the following views:

Overview view - Configure ports and define the startup component.
References view - Add or remove schemas from the transformation.
Data Viewer view - View example input data, run the transformation, and view output results.
Settings view - Configure transformation settings for encoding, output control, and XML generation.
Objects view - Add, modify, or delete Script, XMap and Library objects from the transformation.

Script Help view - Shows context-sensitive help for the Script editor.

Data Processor Events view - Shows information about events that occur when you run the transformation in the Developer tool. Shows initialization, execution, and summary events.

Data Transformation Components

Parser - Converts source documents to XML. The input can have any format. The output of a Parser is XML.
Serializer - Converts an XML file to another document. The output can be any format.
Mapper - Converts an XML source document to another XML document.
Library - Transform an industry message type input into other formats.
Streamer - Splits large input documents, such as multiple gigabyte data streams, into segments.
Xmap - Converts a hierarchical source to another hierarchical structure. XMap has the same functionality as Mapper, but you can use a grid in the Developer tool to define the mapping.

Creating Data Transformation Services

When building Data Transformation services, the following questions should be answered to determine not just what components are required, but how many:

How many components will be required, based on the logical steps in transforming from source to target formats?
Will any of those components require supporting or sub-components? If a process is overly complex or the handling of a data element requires a large number of steps to properly handle the conversion it may require one or many extra components. These components add complexity as well as failure points to the process. The code for a sub-component should be kept small.
Is reusability within the transformation process possible? Can this component be used over again in other components or in other projects? If so, then consider making the process a separate project and place it in the global directory to be utilized by other projects.
Will memory structures be required to hold data (i.e., lookups)? These will require extra XSD structures and will increase the amount of memory resources the project will require.
Is the source Data over 50MB (i.e., Streamer)? Large data files will require enormous amounts of system memory resources to be consumed and can delay the DT process from running while it is trying to load the file into memory. A Streamer allows DT the ability to concurrently process large data files while keeping the memory utilization low, however this does increase the CPU utilization.
Does DT have a library for any data formats being processed? While any format can be processed by Data Transformation the product out of the box offers a number of libraries to handle the more common industry standard formats (e.g., SWIFT, NACHA, HIPPA, EDI, etc.). Any of these libraries can be altered if needed, as some implementations alter the standard to conform to business requirements.
What formats are involved throughout the end-to-end transformation process? Knowing the data formats of both the Source and Target will allow for a better inventory development process. If the Target format cannot be identified in, at a minimum, a “To-Be” state before development, a level of detail should be taken to at least provide some direction with a wire frame. Not doing so can create many hours of recoding and unit testing.
Do any of the source structures need to be custom pre-processed by a transformer? If the source format needs to have a pre-processer run before it can be properly processed in the Data Transformation Engine, it will need to be reviewed for the size of the source data. A pre-processer cannot be executed until the source file or ‘chunked’ Streamer is loaded into memory. The DT Engine must then process the source data and reload. These steps can be memory intensive.

Data Processor Transformation Wizard

A wizard can be used to create an auto-generated Data Processor transformation with input and output formats such as COBOL, XML, ASN.1, relational, or JSON. The wizard can also be used to transform user-defined formats.

The wizard creates a transformation with relevant Script, XMap, or Library objects that serve as templates to transform the input format to the output format. The Data Processor transformation creates a transformation solution according to the formats selected and the specification file, the example file, or the copybook. The transformation might not be complete, but it contains components that you connect and customize to complete the transformation definition.

Implementing Parsers

The processing of semi-structured and/or flat files presents a challenge in that on one hand, the data variability needs to be handled, while at the same time it needs to be put into a format that can easily be processed (such as an XML structure). B2B Data Transformation (DT) can accomplish this by implementing a Parser object.

A Parser defines a process of transforming any source format into XML.

When a file structure is received without a pre-constructed parsed target structure (e.g., EDI, SWIFT, HIPAA etc.) the input data must be evaluated to either decipher what each field is (by use of the header row in the file) or by working with the customer to create a record layout.

Understanding Source Data

When creating the record layout many variables can affect the overall design. The developer should spend adequate time with someone who knows the data to obtain answers to the list of questions below. Taking the proper time to analyze the source data before creation of the XSD structure will significantly decrease rework in the development phase of a project.

Questions to Ask

Is the source file binary or text?
Are records delimited, fixed width or variable?
- If variable, what determines the end or one field from another?
- Are there repeating structures in each record?
  - If yes, are these variable or fixed structure?
- Are there relationships between records?
  - If yes, then what field(s) create the unique ‘key’ to tie them together?
  - Should they be grouped as Parent/Child Structures?
- What are the data types for each field?
  - Are there any restrictions on the data?
    - Max Length
    - Min Length
    - Only Numeric
    - Only Alpha
    - Variable Structure (Ex: 99AA99)
  - How large is the source input?
  - What is the frequency of the processing source?

Creating the XSD

When creating the XSD for a given DT service always try and create global structures that can be reused in other areas. Only create global structures if elements or group of elements are widely used in the XSD. If an element or group is only used once, then make it local. This will not only help to reduce file size and keep down clutter in the XSD but will also help with maintenance for future growth.

Below is an example of two global structures in an XSD and how they are implemented. Use the Annotations to describe what each element or group is and how they relate. For example, if an XSD has many global structures, identify the Root element with a comment in the Annotation.

Global Structures

The GS1 element below ‘references’ the global GS1 structure. The elements that follow are not duplicated under the sequence after the GS1 but are only shown for visual purposes.

Utilize XSD Type

The data type can be marked for each element in an XSD. Those marked with a primitive data type can be used by DT to help with the processing of the data before insertion into this element. By default, DT utilizes the XSD type when inserting data to validate the data being inserted. Additionally, it will only insert the data denoted by the element.

Example:

The snippet of XSD above has one element named ‘Field1’ which can only take data of the integer type. When processing the source data DT will produce results for the following:

Input Data Contains:

No numeric data: The element will not be created.
Only numeric data: Data will be inserted.
Alphanumeric Data: The first character which is of the integer type and all characters sequentially following the first numeric character will be inserted. No other integer data will be captured if there is a non-integer character separating the sequence.

The above scenarios will run without error. The property “disable_XSD_type_search” must be checked in order for DT to throw an exception when processing the source, if the data type of the source does not match what is in the XSD.

Create DT Parser

When creating a Parser, the organization of objects is just as important as the processing that they do. From a maintenance perspective, the main parser script should be used for logic implementation and all main marshalling to other objects. Just as was done with the creation of the XSD in the step above, grouping like segments of code together not only helps with maintenance, but also with readability.

Other scripts can be created for example to hold all the variables used.

Calling Sub Parsers

When processing data in a DT Parser other Parsers may need to be called to help with the complexity of the data being processed. DT provides two main objects to assist with this:

RunParser
EmbeddedParser

RunParser

The RunParser is useful if a file path is being passed to load a file or the data in question is not in the current sequence of the data being processed.

EmbeddedParser

The EmbeddedParser is useful when the data being processed is sequential. Unlike the RunParser the EmbeddedParser does not actually break from the current position of the data being read but instead allows for a finer grain of control on how to process that data. Use the EmbeddedParser if the data has any or all the following attributes:

Different Layout
Repeating Structure
Logic Rules on the data

As shown in the example below, the offset count begins back at 0 inside of the Parser that the EmbeddedParser calls. The Result from the below code is:

Field1 = AAAAA Field2 = 123

The rest of the string (BBBB) was dropped:

Implementing Mappers

Most data being retrieved from an XML structure for a business unit tends to be much smaller than the entire payload. Business units tend to cherry-pick the data out of sub-structures which causes the need for deep traversing into the structure for insignificant amounts of data. Data Transformation can process complex XML structures efficiently allowing for this type of cherry picking to be accomplished easily.

The Data Transformation Mapper handles the processing of XML 2 XML structures. The Mapper loads the source XML into a Document Object Model (DOM) allowing it to quickly move forward and backwards through the DOM.

Flatten Source XML Structure

Loading data into a DOM can be very costly for memory and CPU. If the source XML is 100MB, but the data required is small, it is worth modeling a target XSD that can hold the required data. Processing applications will see a performance boost.

Data Transformation has two functions for the processing of XML:

RunMapper
EmbeddedMapper

RunMapper

The RunMapper function takes an ‘input’ from either another Mapper or any other Data Transformation Component.

Use the RunMapper when processing a whole XML structure. The RunMapper will call a Mapper that has already been created or a Mapper can be created under the RunMapper. Only use the RunMapper if the XML structure is one that has not already been loaded into DOM by another RunMapper.

NOTE: If a RunMapper is being used inside of a Parser, the target XSD will automatically be passed directly into the Mapper being called from the RunMapper and it will provide a performance boost. The ‘input’ attribute does not need to be used.

EmbeddedMapper

An EmbeddedMapper activates a secondary Mapper allowing it to process the XML in the current scope, but without reloading the XML into DOM again. Output from the EmbeddedMapper is stored in the same output structure that the current Mapper has in scope.

The EmbeddedMapper allows for other variable data to be passed into the Mapper called from the EmbeddedMapper by using the schema_connections. If these fields are altered inside the secondary Mapper the parent Mapper Component will also reflect that updated variable.

Implementing Streamers

Source files (binary, text, XML) can sometimes be very large (i.e., more than 100MB in size). In such cases, the files can be too large to hold in memory in a single read; or they can cause the entire DT service to wait extensively for the read to complete before processing. This can cause a serious degradation in performance. An alternate solution is required to handle these large files. Data Processor streamer object can remove these bottlenecks.

DT Streamers can help with performance when there is a need to consume an exceptionally large file. The time and resources needed to load a large file into memory in one pass can be very costly. No matter what 'type' of file the source is, it must be loaded into memory before DT is able to begin processing. If there is a requirement to do some level of pre-processing, there will be the initial time to load, plus the time to reload again into memory so that the pre-processing can be accomplished.

Note: For the terminology used in this Best Practice, a “chunk” refers to a piece of data taken from the entire data source. In DT, all pieces of data are typically stored in memory. Chunks in memory are referred to as “buffers” but these two terms are synonymous. A “physical buffer” refers to the raw chunk (or chunks) of data drawn from the source. A “logical buffer” is the part of the physical buffer that holds a complete section relevant for processing.

If the data is “chunked”, then many very small memory “chunks” are handled in memory, and the strain on the overall system can be greatly reduced resulting in significant performance gains.

Sample DT Streamer Solution

The idea is to treat the source data coming into the Streamer as pass-through buffered data (chunked by size, not a specific logic) fed into a Parser. The Streamer will handle all the logic, composing these chunks into logical components by breaking off chunks based on the logical components they hold, aggregating chunks if needed and/or appending leftovers from previous chunk(s).

The diagrams and steps below describe the components for a sample DT Streamer service:

MainStreamer is the main component.

It is looking for a user-determined buffer size, using the MarkerStreamer.
As long as the file holds buffered data in a size that is larger or equal to the offset (in the provided sample 1MB, but this can change as needed) then that buffer will be what is considered the logical buffer passed to the ManagerParser.
The last chunk may be smaller than the offset.

ManagerParser is the secondary component invoked by the Streamer.
1. It manages the processing of a given buffer and/or appending leftovers.
2. A common scenario for an appending leftover is when a logical buffer of the physical buffer was processed. There could potentially have remained some leftover data in the physical buffer after all relevant logical buffer pieces were processed.
3. The leftovers (if they exist) will be mapped to a designated variable, whose data will in turn be appended before the new physical buffer (note the use of pre-processing by transformers to add the leftover to the next physical buffer). The variable is reset after its data is appended.
Within the “find a logical component to pass to processing component” group, implement the logic by finding the right scope of data (logical buffer). Capture it and then call the parser/serializer/mapper to do the job.
The last content is picking the leftover (if any exists) after the logical buffer was processed out of the physical buffer.

Implementing Libraries

A Library is a Data Processor transformation object that contains predefined components used to transform a range of industry messaging standards. A Data Processor transformation uses a Library to transform an industry message type input into other formats. You can create Library objects for all libraries.

A Library contains a large number of objects and components, such as Parsers, Serializers, and XML schemas, which transform the industry standard input and specific application messages into XML output. A Library might contain objects for message validation, acknowledgments, and diagnostic displays. A Library uses objects to transform the messaging type from industry standard input to XML and from XML to other formats.

You can create Library objects for ACORD, BAI, CREST, DTCC-NSCC, EDIFACT, EDIT-UCS & WINS, EDI_VICS, EDI-X12, FIX, FpML, HIPAA, HIX, HL7, IATA, IDS, MDM Mapping, NACHA, NCPDP, SEPA, SWIFT, and Telekurs libraries.

Each Library object transforms a particular industry standard. For example, the HL7 Library contains components for each of the message types or data structures available in the HL7 messaging standard for medical information systems.