Estimating the Database Size

Success

Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs

Success

Success Accelerators

Accelerate your Purchase to Value engaging with Informatica Architects for Customer Success

My Engagements

All your Engagements at one place
Communities

A collaborative platform to connect and grow with like-minded Informaticans across the globe

Communities

Product Communities

Connect and collaborate with Informatica experts and champions

Discussions

Have a question? Start a Discussion and get immediate answers you are looking for

User Groups

Customer-organized groups that meet online and in-person. Join today to network, share ideas, and get tips on how to get the most out of Informatica

Get Started

Community Guidelines
Knowledge Center

Troubleshooting documents, product guides, how to videos, best practices, and more

Knowledge Center

Knowledge Base

One-stop self-service portal for solutions, FAQs, Whitepapers, How Tos, Videos, and more

Support TV

Video channel for step-by-step instructions to use our products, best practices, troubleshooting tips, and much more

Documentation

Information library of the latest product documents

Velocity (Best Practices)

Best practices and use cases from the Implementation team
Learn

Rich resources to help you leverage full capabilities of our products

Learn

Trainings

Role-based training programs for the best ROI

Certifications

Get certified on Informatica products. Free, Foundation, or Professional

Product Learning Paths

Free and unlimited modules based on your expertise level and journey

Experience Lounge

Self-guided, intuitive experience platform for outcome-focused product capabilities and use cases
Resources

Library of content to help you leverage the best of Informatica products

Resources

Tech Tuesdays Webinars

Most popular webinars on product architecture, best practices, and more

Product Availability Matrix

Product Availability Matrix statements of Informatica products

SupportFlash

Monthly support newsletter

Support Documents

Informatica Support Guide and Statements, Quick Start Guides, and Cloud Product Description Schedule

Product Lifecycle

End of Life statements of Informatica products

Ideas

Events

Change Request Tracking

Marketplace

| Sign up

Velocity
Strategy

Strategy

Data Strategy

Centers of Excellence

Enterprise Data Governance

Enterprise Architecture

Program & Change Management
Solutions

Solutions

Cloud Data Warehouse & Data Lake

Data Lake

Data Warehouse Modernization

Analytics Modernization

Application Integration

360 Engagement

Multidomain MDM

Customer 360 SaaS

Product 360

Supplier 360

Reference 360

Data Governance & Privacy

Data Catalog & Metadata Management

Data Privacy

Regulatory Compliance

Data Quality

Data Access and Provisioning
Stages

Stages

Cloud Data Warehouse & Data Lake

360 Engagement

Data Governance & Privacy

Following a rigorous methodology is key to delivering customer satisfaction and expanding analytics use cases across the business.
More
- Success
  
  Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs
  
  Manage your Success Plans and Engagements, gain key insights into your implementation journey, and collaborate with your CSMs
  
  Success Accelerators
  
  Accelerate your Purchase to Value engaging with Informatica Architects for Customer Success
  
  My Engagements
  
  All your Engagements at one place
- Communities
  
  A collaborative platform to connect and grow with like-minded Informaticans across the globe
  
  A collaborative platform to connect and grow with like-minded Informaticans across the globe
  
  Product Communities
  
  Connect and collaborate with Informatica experts and champions
  
  Discussions
  
  Have a question? Start a Discussion and get immediate answers you are looking for
  
  User Groups
  
  Customer-organized groups that meet online and in-person. Join today to network, share ideas, and get tips on how to get the most out of Informatica
  
  Get Started
  
  Community Guidelines
- Knowledge Center
  
  Troubleshooting documents, product guides, how to videos, best practices, and more
  
  Troubleshooting documents, product guides, how to videos, best practices, and more
  
  Knowledge Base
  
  One-stop self-service portal for solutions, FAQs, Whitepapers, How Tos, Videos, and more
  
  Support TV
  
  Video channel for step-by-step instructions to use our products, best practices, troubleshooting tips, and much more
  
  Documentation
  
  Information library of the latest product documents
  
  Velocity (Best Practices)
  
  Best practices and use cases from the Implementation team
- Learn
  
  Rich resources to help you leverage full capabilities of our products
  
  Rich resources to help you leverage full capabilities of our products
  
  Trainings
  
  Role-based training programs for the best ROI
  
  Certifications
  
  Get certified on Informatica products. Free, Foundation, or Professional
  
  Product Learning Paths
  
  Free and unlimited modules based on your expertise level and journey
  
  Experience Lounge
  
  Self-guided, intuitive experience platform for outcome-focused product capabilities and use cases
- Resources
  
  Library of content to help you leverage the best of Informatica products
  
  Library of content to help you leverage the best of Informatica products
  
  Tech Tuesdays Webinars
  
  Most popular webinars on product architecture, best practices, and more
  
  Product Availability Matrix
  
  Product Availability Matrix statements of Informatica products
  
  SupportFlash
  
  Monthly support newsletter
  
  Support Documents
  
  Informatica Support Guide and Statements, Quick Start Guides, and Cloud Product Description Schedule
  
  Product Lifecycle
  
  End of Life statements of Informatica products
  
  Ideas
  
  Events
  
  Change Request Tracking
  
  Marketplace

Last Updated Date May 26, 2021 |

Stages Cloud Data Warehouse & Data Lake Best Practice

Challenge

The optimal planned configuration of database servers to support the operational workloads requires sizing. Database sizing involves estimating the types and sizes of the components of a data architecture to produce a sizing document. An effective sizing activity involves a number of different roles including data architects, database administrators, and business analysts. Database administrators may also require size estimates for database capacity planning.

Description

The first step in database sizing is to review the system requirements to define:

Expected Macro Data Architecture - Will there be staging areas, operational data stores, centralized data warehouse and/or master data databases and data marts?
Expected Data Architecture Elements - Each additional database element (table, index and column) requires more space. This is especially true when data is replicated across multiple systems such as, a data warehouse where the infrastructure includes an operational data store as well. The same data in the ODS will be present in the warehouse as well, albeit in a different format.
Expected Source Data Volume - It is useful to analyze how each row in the source system translates into the target system. In most situations the row count in the target system can be calculated by following the data flows from the source to the target. For example, say a sales order table is being built by denormalizing a source table. The source table holds sales data for 12 months in a single row (one column for each month). Each row in the source translates to 12 rows in the target. So a source table with one million rows ends up as a 12 million row table.
Data Granularity and Retention Period - Granularity refers to the lowest level of information that is going to be stored in a fact table. Granularity affects the size of a database to a great extent, especially for aggregate tables. The level at which a table has been aggregated increases or decreases a table's row count. For example, a sales order fact table's size is likely to be greatly affected by whether the table is being aggregated at a monthly level or at a quarterly level. The granularity of fact tables is determined by the dimensions linked to that table. The number of dimensions that are connected to the fact tables affects the granularity of the table and in turn the disk space occupied.
Retention Period - In some cases the retention period that will apply to the data needs to be considered. This may vary. For example, data in a staging area for a data warehouse may require lesser retention period than the data that is going to reside in the access layer. Sizing needs to include estimates if a dual data load takes place in real-time to a backup server to support high availability or disaster recovery or maintaining a like production environment.
Load Frequency and Method - Load frequency affects the space requirements for the staging areas. A load plan that updates a target less frequently will often load more data in one go. Therefore, more space is required by the staging areas. A full refresh requires more space for the same reason.

Determining Growth Projections

One way to estimate projections of data growth over time is to use scenario analysis. As an example, for scenario analysis of a sales tracking data mart a number of sales transactions to be stored can be used as the basis for the sizing estimate. In the first year, 10 million sales transactions are expected; this equates to 10 million fact-table records.

Next, use the sales growth forecasts for the upcoming years for database growth calculations. That is, an annual sales growth rate of 10 percent translates into 11 million fact table records for the next year. At the end of five years, the fact table is likely to contain about 60 million records. Calculate other estimates based on five-percent annual sales growth (case 1) and 20-percent annual sales growth (case 2). Multiple projections for best and worst case scenarios can be very helpful.

Consider that growth may not be a linear percentage. Looking at Fibonacci or exponential growth number sets for a period of time may help plan and cost some aspects of data growth.

Oracle Table Space Prediction Model

Oracle (10g and onwards) provides a mechanism to predict the growth of a database. This feature is useful for predicting table space requirements.

Oracle incorporates a table space prediction model in the database engine that provides projected statistics for space used by a table. The following Oracle 10g query returns projected space usage statistics for tables with data:

The QUALITY column indicates the quality of the output as follows:

GOOD - The data for the time point relates to data within the Automatic Workload Repository (AWR), an inbuilt repository that exists in every Oracle database. Oracle takes snapshots of vital statistics in this repository at regular intervals with a timestamp within 10 percent of the interval.
INTERPOLATED - The data for this time point did not meet the GOOD criteria but was based on data gathered before and after the time point.
PROJECTED - The time point is in the future, so the data is estimated based on previous growth statistics.

Baseline Volumetric

Next, use the physical data models for the sources and the target architecture to develop a baseline sizing estimate. The administration guides for most databases contain sizing guidelines for the various database structures such as tables, indexes, sort space, data files, log files, and database cache.

Develop a detailed sizing using a worksheet inventory of the tables and indexes from the physical data model, along with field data types and field sizes. Various database products use different storage methods for data types. For this reason, be sure to use the database manuals to determine the size of each data type. Add up the field sizes to determine row size. Then use the data volume projections to determine the number of rows to multiply by the table size.

The default estimate for index size is to assume same size as the table size. Also estimate the temporary space for sort operations. For data warehouse applications where summarizations are common, plan on large temporary spaces. The temporary space can be as much as 1.5 times larger than the largest table in the database.

Another approach that is useful is to load the data architecture with representative data and determine the resulting database sizes. This test load can be a fraction of the actual data and is used only to gather basic sizing statistics. Then need to apply growth projections to these statistics. For example, after loading ten thousand sample records to the fact table, size is determined to be 10MB. Based on the scenario analysis, expect this fact table to contain 60 million records after five years. So, the estimated size for the fact table is about 60GB [i.e., 10 MB * (60,000,000/10,000)]. Do not forget to add indexes and summary tables to the calculations.

Estimating

When there is not enough information to calculate an estimate as described above, use educated guesses and rules of thumb to develop as reasonable an estimate as possible.

If the source data model is not available, use what is known to determine the source data for an estimate of average field size and average number of fields in a row to determine table size. Based on the understanding of transaction volume over time, determine the growth metrics for each type of data and calculate the source data volume (SDV) from table size and growth metrics.
If the target data architecture is not complete sufficiently to determine table sizes, base estimates on multiples of the SDV:
- If it includes staging areas: add another SDV for any source subject area that will be staged, multiplied by the number of loads that will be retained in staging.
- If data will be consolidated into an operational data store, add the SDV multiplied by the number of loads to be retained in the ODS for historical purposes (e.g., keeping one year’s worth of monthly loads = 12 x SDV)
- Data warehouse architectures are based on the frequency and granularity of the data in the warehouse; this may be another SDV + (.3n x SDV where n = number of time periods loaded in the warehouse over time)
- If the data architecture includes aggregates, add a percentage of the warehouse volume based on how much of the warehouse data will be aggregated and to what level (for example, if the rollup level represents 10 percent of the dimensions at the details level, use 10 percent).
- Similarly, for data marts add a percentage of the data warehouse based on how much of the warehouse data is moved to the data mart.
- Be sure to consider the growth projections over time and the history to be retained in all the calculations.

It is very often the case that there is much more data than expected. Therefore Informatica recommends adding a reasonable margin of safety to the calculations. From a risk perspective it is better to over purchase some storage, especially if it is of the inexpensive kind, than to find the project is held up later and additional hardware related costs are incurred as storage has to again be acquired.

‘Database Sizing Model’ calculator is contained in ‘Database Capacity Planning’ accelerator.