You can optimize the performance of File Ingestion and Replication tasks by configuring an appropriate file size and batch size.
Source File Size
For large source files, consider splitting them into multiple smaller files, as processing multiple small files is faster than processing a single large file. The recommended size for each split file is around 500 MB to 1 GB.
Batch size
When configuring a File Ingestion and Replication job to write to a target, the number of files written with each COPY command affects performance. You should specify a batch size in the source properties of the task to optimize this performance. The default batch size is 5. Snowflake Cloud Data Warehouse V2 Connector's maximum batch size is 1000 when writing from Amazon S3 or Azure Blob Storage. Similarly, the Databricks DB SQL Connector allows a maximum batch size of 1000 from these sources. For other sources, the batch size must be between 1 and 20.
Consider the following guidelines to optimize performance:
- When you specify a batch size, you must also ensure that the source file is split into multiple smaller files to optimize performance.
- For optimized performance, batch size should equal or close to the number of files to be loaded in the target.
- Increasing the number of parallel batches with batch size can further improve performance.
- For information about the appropriate batch size value that you can specify for different sources and targets in a File ingestion task, see the Local folder source properties topic in the File Ingestion help.