Every database administrator deals with this ETL headache at some point in their career. As data volumes grow, ETL processes start to take longer to complete. Cue the complaints from business users:
- “Why do our data loads take so long to run?”
- “Why can’t we get our reports out earlier?”
- “I get to the office early and need to be able to see results by the time I get in!”
Performance issues are something that many organizations start to see when scaling up their platforms to handle more data. These issues can be attributed to many different things. Sub-optimal database configurations and indexing strategies are common culprits that don’t show themselves until larger data volumes are processed. However, performance issues can often be attributed to ETL code starting to struggle under the load. Ensuring that the design of your ETL processes is scalable from the beginning will greatly lower the chances that the ETL component of the equation is not the issue.
Here are some simple tips you can follow during the design phase to ensure your ETL processes are running as fast as possible:
- For batch processing, only pull the minimum data you need from your source systems. While you can take the paranoid approach and pull everything to determine changes, all that data will need to be compared to target data in order to determine what changed and which eats additional memory and processing cycles. Consider whether you can use timestamps or sequence numbers from your source systems to figure out what changed.
- Don’t perform cached lookups in your ETL tool against large tables. Cached lookups are great for small dimension and code table lookups, but use some discretion before blindly going down the cached lookup route. That transaction table with 50 million records? Probably not a good idea to cache that table in memory. Try using a left outer join instead.
- Avoid using row-by-row processing in your ETL tool. You should be using the bulk-loading options of your ETL tool, as it is tuned for processing large data volumes. Moreover, if you are using SSIS, don’t use the built in Slowly Changing Dimension component for performance sensitive applications. It uses row-by-row updates by default which can grind everything to a halt. Take a look at the Multiple Hash component on CodePlex to calculate your own record hashes and roll your own using intermediate working tables to handle updates.
- For sources coming from a database, make sure you are offloading table joins to the database. Chances are that your database can do the join much more efficiently that your ETL tool can.
- Never use cursors in any ETL code. Cursors have their use, but they should not be used in any regularly-scheduled ETL processes. Rewrite your code using set logic instead. I have seen some people try to get creative with this rule and use a While Loop to iterate through each record in a dataset. No, that does not make it better.
- Check your ETL scheduling logic and make sure you are utilizing as many parallel threads as possible. For example, there is no good reason to run all your code lookup tables sequentially when none of them are dependent on each other. Removing unnecessary dependencies will help make sure all your CPU cores are staying busy.
- Make sure you have a step at the end of your jobs to rebuild indexes on your reporting tables. This will help make sure your indexes are in optimal shape not only for your reports, but also for any downstream ETL processes that hit those tables.
- Check that the Auto Statistics Update option on your target database is turned on. If your database does not support it or you have it turned off for other reasons, make sure your statistics are updated at the end of each ETL run. You don’t want your query to start crawling because it selects a bad query plan when you can fix that pretty easily.
- Limit the number of CTEs and joins you are doing with a single query. For SQL Server, the query optimizer starts choosing sub-optimal query plans at around 8 tables. If you need that many joins, evaluate whether staging data in intermediate temp tables part of the way through makes more sense.
- If you are using stored procedures as part of your ETL processing, copy all input parameters into variables internal to the stored procedure. SQL Server sometimes suffers from parameter sniffing, where it compiles a query execution plan using a parameter value that does not well represent what your data looks like. Using internal variables helps mitigate against this.
As with any other implementation, planning ahead of time will pay dividends down the line once everything is running in a production state. If you follow these suggestions, you will be well on your way to making sure your ETL processes are running as fast as possible.