Data Transformation is an application that processes complex files, such as messaging formats, HTML pages, and PDF documents. Data Transformation also transforms formats such as ACORD, HIPAA, HL7, EDI-X12, EDIFACT, AFP, and SWIFT. It is a comprehensive tool for transforming data from a variety of hierarchical, semi-structured and unstructured formats to virtually any other format.
Required components that need to be built as part of a Data Transformation services include Streamers, Parsers, Mappers, and Serializers. To create any-to-any transformations, components may (and should) be chained. A Parser normally covers any format to XML. A Mapper covers one representation of XML to another, and a Serializer covers XML to any end-format. Streamers are normally used on the front-end to break down large files, although they can be used on the backend as well. Steamers are available for textual or XML structures. The number of components needed to properly build the DT service depends upon the source data and its complexity.
Data Transformation installs by default when you install Informatica Developer (the Developer tool).
The Data Processor transformation has multiple views that you access when you configure the transformation and run it in the Developer tool. Some of the Data Processor transformation views do not appear in the Developer tool by default. To change the views for the transformation, click Window > Show View > Other > Informatica.
The Data Processor transformation has the following views:
When building Data Transformation services, the following questions should be answered to determine not just what components are required, but how many:
A wizard can be used to create an auto-generated Data Processor transformation with input and output formats such as COBOL, XML, ASN.1, relational, or JSON. The wizard can also be used to transform user-defined formats.
The wizard creates a transformation with relevant Script, XMap, or Library objects that serve as templates to transform the input format to the output format. The Data Processor transformation creates a transformation solution according to the formats selected and the specification file, the example file, or the copybook. The transformation might not be complete, but it contains components that you connect and customize to complete the transformation definition.
The processing of semi-structured and/or flat files presents a challenge in that on one hand, the data variability needs to be handled, while at the same time it needs to be put into a format that can easily be processed (such as an XML structure). B2B Data Transformation (DT) can accomplish this by implementing a Parser object.
A Parser defines a process of transforming any source format into XML.
When a file structure is received without a pre-constructed parsed target structure (e.g., EDI, SWIFT, HIPAA etc.) the input data must be evaluated to either decipher what each field is (by use of the header row in the file) or by working with the customer to create a record layout.
When creating the record layout many variables can affect the overall design. The developer should spend adequate time with someone who knows the data to obtain answers to the list of questions below. Taking the proper time to analyze the source data before creation of the XSD structure will significantly decrease rework in the development phase of a project.
When creating the XSD for a given DT service always try and create global structures that can be reused in other areas. Only create global structures if elements or group of elements are widely used in the XSD. If an element or group is only used once, then make it local. This will not only help to reduce file size and keep down clutter in the XSD but will also help with maintenance for future growth.
Below is an example of two global structures in an XSD and how they are implemented. Use the Annotations to describe what each element or group is and how they relate. For example, if an XSD has many global structures, identify the Root element with a comment in the Annotation.
Global Structures
The GS1 element below ‘references’ the global GS1 structure. The elements that follow are not duplicated under the sequence after the GS1 but are only shown for visual purposes.
The data type can be marked for each element in an XSD. Those marked with a primitive data type can be used by DT to help with the processing of the data before insertion into this element. By default, DT utilizes the XSD type when inserting data to validate the data being inserted. Additionally, it will only insert the data denoted by the element.
Example:
The snippet of XSD above has one element named ‘Field1’ which can only take data of the integer type. When processing the source data DT will produce results for the following:
Input Data Contains:
The above scenarios will run without error. The property “disable_XSD_type_search” must be checked in order for DT to throw an exception when processing the source, if the data type of the source does not match what is in the XSD.
When creating a Parser, the organization of objects is just as important as the processing that they do. From a maintenance perspective, the main parser script should be used for logic implementation and all main marshalling to other objects. Just as was done with the creation of the XSD in the step above, grouping like segments of code together not only helps with maintenance, but also with readability.
Other scripts can be created for example to hold all the variables used.
When processing data in a DT Parser other Parsers may need to be called to help with the complexity of the data being processed. DT provides two main objects to assist with this:
The RunParser is useful if a file path is being passed to load a file or the data in question is not in the current sequence of the data being processed.
The EmbeddedParser is useful when the data being processed is sequential. Unlike the RunParser the EmbeddedParser does not actually break from the current position of the data being read but instead allows for a finer grain of control on how to process that data. Use the EmbeddedParser if the data has any or all the following attributes:
As shown in the example below, the offset count begins back at 0 inside of the Parser that the EmbeddedParser calls. The Result from the below code is:
Field1 = AAAAA Field2 = 123
The rest of the string (BBBB) was dropped:
Most data being retrieved from an XML structure for a business unit tends to be much smaller than the entire payload. Business units tend to cherry-pick the data out of sub-structures which causes the need for deep traversing into the structure for insignificant amounts of data. Data Transformation can process complex XML structures efficiently allowing for this type of cherry picking to be accomplished easily.
The Data Transformation Mapper handles the processing of XML 2 XML structures. The Mapper loads the source XML into a Document Object Model (DOM) allowing it to quickly move forward and backwards through the DOM.
Loading data into a DOM can be very costly for memory and CPU. If the source XML is 100MB, but the data required is small, it is worth modeling a target XSD that can hold the required data. Processing applications will see a performance boost.
Data Transformation has two functions for the processing of XML:
The RunMapper function takes an ‘input’ from either another Mapper or any other Data Transformation Component.
Use the RunMapper when processing a whole XML structure. The RunMapper will call a Mapper that has already been created or a Mapper can be created under the RunMapper. Only use the RunMapper if the XML structure is one that has not already been loaded into DOM by another RunMapper.
NOTE: If a RunMapper is being used inside of a Parser, the target XSD will automatically be passed directly into the Mapper being called from the RunMapper and it will provide a performance boost. The ‘input’ attribute does not need to be used.
An EmbeddedMapper activates a secondary Mapper allowing it to process the XML in the current scope, but without reloading the XML into DOM again. Output from the EmbeddedMapper is stored in the same output structure that the current Mapper has in scope.
The EmbeddedMapper allows for other variable data to be passed into the Mapper called from the EmbeddedMapper by using the schema_connections. If these fields are altered inside the secondary Mapper the parent Mapper Component will also reflect that updated variable.
Source files (binary, text, XML) can sometimes be very large (i.e., more than 100MB in size). In such cases, the files can be too large to hold in memory in a single read; or they can cause the entire DT service to wait extensively for the read to complete before processing. This can cause a serious degradation in performance. An alternate solution is required to handle these large files. Data Processor streamer object can remove these bottlenecks.
DT Streamers can help with performance when there is a need to consume an exceptionally large file. The time and resources needed to load a large file into memory in one pass can be very costly. No matter what 'type' of file the source is, it must be loaded into memory before DT is able to begin processing. If there is a requirement to do some level of pre-processing, there will be the initial time to load, plus the time to reload again into memory so that the pre-processing can be accomplished.
Note: For the terminology used in this Best Practice, a “chunk” refers to a piece of data taken from the entire data source. In DT, all pieces of data are typically stored in memory. Chunks in memory are referred to as “buffers” but these two terms are synonymous. A “physical buffer” refers to the raw chunk (or chunks) of data drawn from the source. A “logical buffer” is the part of the physical buffer that holds a complete section relevant for processing.
If the data is “chunked”, then many very small memory “chunks” are handled in memory, and the strain on the overall system can be greatly reduced resulting in significant performance gains.
The idea is to treat the source data coming into the Streamer as pass-through buffered data (chunked by size, not a specific logic) fed into a Parser. The Streamer will handle all the logic, composing these chunks into logical components by breaking off chunks based on the logical components they hold, aggregating chunks if needed and/or appending leftovers from previous chunk(s).
The diagrams and steps below describe the components for a sample DT Streamer service:
MainStreamer is the main component.
A Library is a Data Processor transformation object that contains predefined components used to transform a range of industry messaging standards. A Data Processor transformation uses a Library to transform an industry message type input into other formats. You can create Library objects for all libraries.
A Library contains a large number of objects and components, such as Parsers, Serializers, and XML schemas, which transform the industry standard input and specific application messages into XML output. A Library might contain objects for message validation, acknowledgments, and diagnostic displays. A Library uses objects to transform the messaging type from industry standard input to XML and from XML to other formats.
You can create Library objects for ACORD, BAI, CREST, DTCC-NSCC, EDIFACT, EDIT-UCS & WINS, EDI_VICS, EDI-X12, FIX, FpML, HIPAA, HIX, HL7, IATA, IDS, MDM Mapping, NACHA, NCPDP, SEPA, SWIFT, and Telekurs libraries.
Each Library object transforms a particular industry standard. For example, the HL7 Library contains components for each of the message types or data structures available in the HL7 messaging standard for medical information systems.