Optimizing the Data Archive Cycle Volume

Success

Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs

Success

Success Accelerators

Accelerate your Purchase to Value engaging with Informatica Architects for Customer Success

My Engagements

All your Engagements at one place
Communities

A collaborative platform to connect and grow with like-minded Informaticans across the globe

Communities

Product Communities

Connect and collaborate with Informatica experts and champions

Discussions

Have a question? Start a Discussion and get immediate answers you are looking for

User Groups

Customer-organized groups that meet online and in-person. Join today to network, share ideas, and get tips on how to get the most out of Informatica

Get Started

Community Guidelines
Knowledge Center

Troubleshooting documents, product guides, how to videos, best practices, and more

Knowledge Center

Knowledge Base

One-stop self-service portal for solutions, FAQs, Whitepapers, How Tos, Videos, and more

Support TV

Video channel for step-by-step instructions to use our products, best practices, troubleshooting tips, and much more

Documentation

Information library of the latest product documents

Velocity (Best Practices)

Best practices and use cases from the Implementation team
Learn

Rich resources to help you leverage full capabilities of our products

Learn

Trainings

Role-based training programs for the best ROI

Certifications

Get certified on Informatica products. Free, Foundation, or Professional

Product Learning Paths

Free and unlimited modules based on your expertise level and journey

Experience Lounge

Self-guided, intuitive experience platform for outcome-focused product capabilities and use cases
Resources

Library of content to help you leverage the best of Informatica products

Resources

Tech Tuesdays Webinars

Most popular webinars on product architecture, best practices, and more

Product Availability Matrix

Product Availability Matrix statements of Informatica products

SupportFlash

Monthly support newsletter

Support Documents

Informatica Support Guide and Statements, Quick Start Guides, and Cloud Product Description Schedule

Product Lifecycle

End of Life statements of Informatica products

Ideas

Events

Change Request Tracking

Marketplace

| Sign up

Velocity
Strategy

Strategy

Data Strategy

Centers of Excellence

Enterprise Data Governance

Enterprise Architecture

Program & Change Management
Solutions

Solutions

Cloud Data Warehouse & Data Lake

Data Lake

Data Warehouse Modernization

Analytics Modernization

Application Integration

360 Engagement

Multidomain MDM

Customer 360 SaaS

Product 360

Supplier 360

Reference 360

Data Governance & Privacy

Data Catalog & Metadata Management

Data Privacy

Regulatory Compliance

Data Quality

Data Access and Provisioning
Stages

Stages

Cloud Data Warehouse & Data Lake

360 Engagement

Data Governance & Privacy

Following a rigorous methodology is key to delivering customer satisfaction and expanding analytics use cases across the business.
More
- Success
  
  Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs
  
  Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs
  
  Success Accelerators
  
  Accelerate your Purchase to Value engaging with Informatica Architects for Customer Success
  
  My Engagements
  
  All your Engagements at one place
- Communities
  
  A collaborative platform to connect and grow with like-minded Informaticans across the globe
  
  A collaborative platform to connect and grow with like-minded Informaticans across the globe
  
  Product Communities
  
  Connect and collaborate with Informatica experts and champions
  
  Discussions
  
  Have a question? Start a Discussion and get immediate answers you are looking for
  
  User Groups
  
  Customer-organized groups that meet online and in-person. Join today to network, share ideas, and get tips on how to get the most out of Informatica
  
  Get Started
  
  Community Guidelines
- Knowledge Center
  
  Troubleshooting documents, product guides, how to videos, best practices, and more
  
  Troubleshooting documents, product guides, how to videos, best practices, and more
  
  Knowledge Base
  
  One-stop self-service portal for solutions, FAQs, Whitepapers, How Tos, Videos, and more
  
  Support TV
  
  Video channel for step-by-step instructions to use our products, best practices, troubleshooting tips, and much more
  
  Documentation
  
  Information library of the latest product documents
  
  Velocity (Best Practices)
  
  Best practices and use cases from the Implementation team
- Learn
  
  Rich resources to help you leverage full capabilities of our products
  
  Rich resources to help you leverage full capabilities of our products
  
  Trainings
  
  Role-based training programs for the best ROI
  
  Certifications
  
  Get certified on Informatica products. Free, Foundation, or Professional
  
  Product Learning Paths
  
  Free and unlimited modules based on your expertise level and journey
  
  Experience Lounge
  
  Self-guided, intuitive experience platform for outcome-focused product capabilities and use cases
- Resources
  
  Library of content to help you leverage the best of Informatica products
  
  Library of content to help you leverage the best of Informatica products
  
  Tech Tuesdays Webinars
  
  Most popular webinars on product architecture, best practices, and more
  
  Product Availability Matrix
  
  Product Availability Matrix statements of Informatica products
  
  SupportFlash
  
  Monthly support newsletter
  
  Support Documents
  
  Informatica Support Guide and Statements, Quick Start Guides, and Cloud Product Description Schedule
  
  Product Lifecycle
  
  End of Life statements of Informatica products
  
  Ideas
  
  Events
  
  Change Request Tracking
  
  Marketplace

Last Updated Date May 25, 2021 |

Stages Cloud Data Warehouse & Data Lake Best Practice

Challenge

A Data Archive project is actually comprised of several mini projects and two of these mini projects require some level of optimization in order to be successful. The first of these mini projects is the initial archive of data up to the retention policy for all of the entities that were defined as “in scope” for the project. The second mini project is the continuation of archive cycle runs on a periodic basis after the initial archive is complete for all the in scope entities. The challenge is to identify the fastest and most efficient way to relocate all of the initial data and then to determine the frequency for running each entity going forward in order to control growth.

Description

Determine Optimal Volume

For each of the entities defined as in scope for the project there is an optimal amount of data that can be relocated in a single cycle or run of a project. The optimal volume for one entity is not related to the optimal volume for another entity so this will need to be determined separately for each entity. For some entities where the volume is low, they will not reach their optimal volume even if all of the initial data were archived in a single cycle. In such cases, cycles will be run based purely on functional considerations. For most entities in the initial project this will not be the case, unless they were included because there were prerequisites to running larger volume entities. Most of the entities in the initial project will have some performance considerations and the following process should be followed for each one.

Create an archive project and run the first job for a very small amount of data, usually the minimum is one month but for some financial entities the minimum is one fiscal year. Most entities allow for restricting the candidates by a specific organization or ledger or set of books in addition to using a date parameter. In some cases the volume will be so large that this process should be run by organization or set of books or ledger in addition to using a date parameter, but if possible use all organizations or set of books or ledgers. This process is used to determine a value for the number of rows relocated per unit of time, and if it is decided to archive by organizations or set of books or ledgers later, the same ratio will apply.

Run the archive cycle/project for the small amount of data from beginning to end and note the time to complete the generation of candidates, copy to staging, copy to destination, and delete from source. Also note the number of transactions that show up as eligible in the summary report and add up the total number of rows relocated for all of the tables in the cycle. At this point some queries can be run or an educated guess can be made using the DGA to get a date parameter for the next run. For most entities, if the first job processed one month’s worth of data then the next should process three months’ worth of data. Run the next job and note the times, the number of eligible candidates and the total rows. Continue running jobs, increasing the number of eligible candidates in each subsequent run up to 1 years’ worth of data, if possible, in a single cycle. It may not be possible to archive one year’s worth of data in a single cycle, but the purpose of this process is to determine how much data is too much to relocate in a single cycle.

The very small cycles will take almost as long to run as cycles archiving much more data up to a point. Calculate the number of rows processed for each cycle per unit of time and eventually, if the volume is sufficient, relocating twice as many rows will take much longer than if the data had been relocated by running two cycles. The process can be continued for entities where one year’s worth of data performed well to determine how much more data can be archived without hitting performance issues. Although typically, for restore considerations, the initial archive for that entity would be run for just one year’s worth of data at a time. The single cycle volume with the highest number of rows processed per unit of time is the optimal volume for that entity.

Determine Frequency for Future Cycles

Determining the frequency for future cycles for some entities is easier than for others (i.e., for a financial entity that has to be run by fiscal year, the frequency is easy, run once a year). Other date based entities that allow for more granularities in the time scale will have more flexibility in the frequency that they need to be run. For any entity, the minimum frequency would be once a year and some entities may need to be run as often as weekly in some environments. The frequency can be determined based on the optimal volume and it may need to be adjusted over time. One month of newer data may be more than six months of older data when the volume is considered. Queries can be run or the growth chart from the DGA can be used to determine data growth over time. As it is always data older than the retention policy that is being relocated any growth spurts can be anticipated well in advance. Set the frequency of runs to allow for skipping a cycle when the data archive jobs need to be scheduled to fit the needs of the business users who own the data. There may be certain times when the jobs cannot be run (i.e., month-end close).

While the frequency for every entity does not have to match, in some instances that may be the desired way to continue archiving data. There could be entities where the data needs to be archived in one month increments, but instead of running monthly the data is archived annually in twelve separate jobs that are scheduled to run one after the other or over multiple weekends or evenings. If several financial entities are run annually; and several transactional entities are then run monthly; and the old inactive data is frequently accessed; and drill down into the financial applications is required; it may force the use of combined seamless access. If everything is kept in sync the use of combined seamless access can be reduced. If seamless access does not require inter-application drill down or it is infrequently used then it can be excluded from the decision on how frequently to run each entity.

The frequency for running archive jobs is always going to vary since there will be different factors to consider in each specific case. Knowing how much data can be archived in a single cycle for each entity is the foundation required to work with the owners of all of the different pieces of data to determine that frequency. The data access requirements for the data should be considered along with the performance implications. If all of the data is archived up to the retention policy and then the associated tables are reorganized to realize performance benefits and then cycles are run weekly or monthly going forward, subsequent reorganizations would most likely not be required. If the data must be archived weekly or monthly, but it is done annually in 12 or 52 separate jobs then the tables will probably need to be reorganized after the archive jobs are run each year.

Optimizing the Data Archive Cycle Volume

Cloud Data Warehouse & Data Lake

Challenge

Description

Determine Optimal Volume

Determine Frequency for Future Cycles

Table of Contents