Last Updated Date Aug 19, 2022 |

Challenge

In a typical Data Warehousing environment the loading of full volume historical data load is often done during the project's Go Live phase. When historical data resides in relational databases, data can be incrementally extracted by adjusting date parameters. For flat file sources historical load process can be automated by following the procedure outlined below.

Description

Historical load process for flat files can be automated by using a simple shell script and an Informatica workflow with a command task. The Informatica workflow with command task is used to call the shell script by passing input arguments. The shell script iterates over a list of files present in $PMSourceFileDir (or any project specific Source file directory) and updates a file list. This file list is used by a child workflow to process actual business logic. After processing the file, the shell script archives files to a specified directory and simplifies the restart procedure if the job fails due to server outage or process-related issues.

This method of processing historical data files is considered a best practice as it doesn’t involve any manual intervention, thus eliminating any scope for manual errors. It also supports the fact that jobs should not be modified in a Production environment under any circumstances after project deployment. In addition, automatic processing of historical data files accelerates the data load process, and the application goes live faster when compared to manual processing of historical data. This method is superior to using file lists for the following reasons:

  • The amount of data processed per batch can be regulated. For example, if there is a need to process 100-150 historical data files (each file containing millions of records) it would be a bad idea to use filelist if mappings contained caching transformations such as sorters or aggregators. These might cause performance issues during the load process. The methodology described in this article provides better control over the load process.
  • Better Restartability – As files are processed in a sequential order one-by-one, it becomes very easy to restart the process from where it failed. Processed files are automatically moved to an archive folder.
  • Traceability – It becomes easy to debug any data or process related issues as the exact process date and time of each data file is logged.

Process Flow

Script with Usage Example

The script below can be used in the project with slight modifications. This script fetches repository and integration service-related details from the infa_env environment file (Sample shown below).

Usage: $PMRootDir/Script/infa_load.sh [Source Folder] [File List] [Workflow] [FOLDER]

Mandatory flags:

  • s: Source File directory
  • l: file list name
  • w: workflow
  • p: Informatica Project Folder
  • f: file pattern (like '*.txt')

Historical Data Load Script Sample

Informatica Environment File Sample

Table of Contents

Success

Link Copied to Clipboard