Following a rigorous methodology is key to delivering customer satisfaction and expanding analytics use cases across the business.
A Data Archive project is actually comprised of several mini projects and two of these mini projects require some level of optimization in order to be successful. The first of these mini projects is the initial archive of data up to the retention policy for all of the entities that were defined as “in scope” for the project. The second mini project is the continuation of archive cycle runs on a periodic basis after the initial archive is complete for all the in scope entities. The challenge is to identify the fastest and most efficient way to relocate all of the initial data and then to determine the frequency for running each entity going forward in order to control growth.
For each of the entities defined as in scope for the project there is an optimal amount of data that can be relocated in a single cycle or run of a project. The optimal volume for one entity is not related to the optimal volume for another entity so this will need to be determined separately for each entity. For some entities where the volume is low, they will not reach their optimal volume even if all of the initial data were archived in a single cycle. In such cases, cycles will be run based purely on functional considerations. For most entities in the initial project this will not be the case, unless they were included because there were prerequisites to running larger volume entities. Most of the entities in the initial project will have some performance considerations and the following process should be followed for each one.
Create an archive project and run the first job for a very small amount of data, usually the minimum is one month but for some financial entities the minimum is one fiscal year. Most entities allow for restricting the candidates by a specific organization or ledger or set of books in addition to using a date parameter. In some cases the volume will be so large that this process should be run by organization or set of books or ledger in addition to using a date parameter, but if possible use all organizations or set of books or ledgers. This process is used to determine a value for the number of rows relocated per unit of time, and if it is decided to archive by organizations or set of books or ledgers later, the same ratio will apply.
Run the archive cycle/project for the small amount of data from beginning to end and note the time to complete the generation of candidates, copy to staging, copy to destination, and delete from source. Also note the number of transactions that show up as eligible in the summary report and add up the total number of rows relocated for all of the tables in the cycle. At this point some queries can be run or an educated guess can be made using the DGA to get a date parameter for the next run. For most entities, if the first job processed one month’s worth of data then the next should process three months’ worth of data. Run the next job and note the times, the number of eligible candidates and the total rows. Continue running jobs, increasing the number of eligible candidates in each subsequent run up to 1 years’ worth of data, if possible, in a single cycle. It may not be possible to archive one year’s worth of data in a single cycle, but the purpose of this process is to determine how much data is too much to relocate in a single cycle.
The very small cycles will take almost as long to run as cycles archiving much more data up to a point. Calculate the number of rows processed for each cycle per unit of time and eventually, if the volume is sufficient, relocating twice as many rows will take much longer than if the data had been relocated by running two cycles. The process can be continued for entities where one year’s worth of data performed well to determine how much more data can be archived without hitting performance issues. Although typically, for restore considerations, the initial archive for that entity would be run for just one year’s worth of data at a time. The single cycle volume with the highest number of rows processed per unit of time is the optimal volume for that entity.
Determining the frequency for future cycles for some entities is easier than for others (i.e., for a financial entity that has to be run by fiscal year, the frequency is easy, run once a year). Other date based entities that allow for more granularities in the time scale will have more flexibility in the frequency that they need to be run. For any entity, the minimum frequency would be once a year and some entities may need to be run as often as weekly in some environments. The frequency can be determined based on the optimal volume and it may need to be adjusted over time. One month of newer data may be more than six months of older data when the volume is considered. Queries can be run or the growth chart from the DGA can be used to determine data growth over time. As it is always data older than the retention policy that is being relocated any growth spurts can be anticipated well in advance. Set the frequency of runs to allow for skipping a cycle when the data archive jobs need to be scheduled to fit the needs of the business users who own the data. There may be certain times when the jobs cannot be run (i.e., month-end close).
While the frequency for every entity does not have to match, in some instances that may be the desired way to continue archiving data. There could be entities where the data needs to be archived in one month increments, but instead of running monthly the data is archived annually in twelve separate jobs that are scheduled to run one after the other or over multiple weekends or evenings. If several financial entities are run annually; and several transactional entities are then run monthly; and the old inactive data is frequently accessed; and drill down into the financial applications is required; it may force the use of combined seamless access. If everything is kept in sync the use of combined seamless access can be reduced. If seamless access does not require inter-application drill down or it is infrequently used then it can be excluded from the decision on how frequently to run each entity.
The frequency for running archive jobs is always going to vary since there will be different factors to consider in each specific case. Knowing how much data can be archived in a single cycle for each entity is the foundation required to work with the owners of all of the different pieces of data to determine that frequency. The data access requirements for the data should be considered along with the performance implications. If all of the data is archived up to the retention policy and then the associated tables are reorganized to realize performance benefits and then cycles are run weekly or monthly going forward, subsequent reorganizations would most likely not be required. If the data must be archived weekly or monthly, but it is done annually in 12 or 52 separate jobs then the tables will probably need to be reorganized after the archive jobs are run each year.
Success
Link Copied to Clipboard