Tips on supercharging your ETL

We live in the era of big data explosion with data volumes growing at exponential pace.  At the same time data driven organizations increasingly demand more real time data availability. Given these trends, ETL performance is quickly gaining business importance. Let’s look at a few approaches that could be considered to maximize performance of data processing.

Before jumping into potential approaches, let’s review a typical approach. Typical extract, transform and load (ETL) processes are designed as a script with top-down orchestration. Top-level job/package orchestrates the intermediate-level packages, which in turn, drive components that actually perform data manipulation as presented in the diagram below:

ETL 1

The processing is often sequential – the next component does not start until previous component completes execution. These sequential dependencies are sometimes warranted by logical processing dependencies, but frequently they are the result of poor ETL design. In both cases these approaches lead to poor ETL performance by making data processing single-threaded.

Now, let’s look at some alternative approaches:

1)    Remove redundant dependencies

Remove unneeded dependencies and execute as many jobs/packages as possible concurrently. To achieve this, first identify logical dependencies in your process and assign each job or package into a dependency group. Aim to minimize the number of dependency groups. Here is quick example:

  1. Group 1 – land, stage and transform source data
  2. Group 2 – process dimensions
  3. Group 3 – process facts
  4. Group 4 – process cubes

Within each dependency group all jobs should be executed concurrently, e.g. dimensions will be processed in parallel (assuming star schema with no dependencies between dimensions).

Advantages / Disadvantages:

  • Enabling job to execute in parallel shortens load time
  • Performance can be further improved by scaling up ETL server resources (CPU/RAM)
  • Full dependency analysis is required to avoid conflicts between jobs

2)   Identify and fix laggards

Previously, you split processing into a minimal number of logical dependency groups and enabled concurrent execution, but the process still remains slow. What’s next? First, find the worst performing components by analyzing ETL execution logs. Then, identify performance bottlenecks for each slow job/package:

Performance bottleneck How to address
Source (file, database, API call)
  • Pre-stage data to make it available in time for the next ETL run
  • Pull source data by running multiple, concurrent packages
Destination (database)
  • Check the structure of destination tables: fastest destination is heap table with no indexes
  • Partition table and bulk load into empty partitions
Transformations requiring intensive RAM/CPU usage
  • Scale-out your processing by moving resource-intensive jobs to dedicated servers
  • Run transformations as multiple multi-threaded processes

 

Advantages / Disadvantages:

  • As we already multi-threaded all data processing tasks, optimizing worst performing jobs will result in direct reduction of overall processing time.
  • Substantial performance benefits can be realized
  • Many of recommended above approaches require intensive code redesign

3) Smart ETL jobs and queuing

You’ve optimized your code and made good performance improvements but there are still a couple jobs that run too long. How can you process the data even faster? Consider a “smart” ETL architecture that is built on a unit of work concept and queue-based processing:

ETL 2

In this architecture design, each job/package is not top-down orchestrated. Instead, it runs independently in multiple identical instances, potentially on multiple ETL servers if you need to scale out processing. Each instance continuously requests the unit of work (UW) from the queue, receives and processes it. What can represent a unit of work? Any independent data processing operation. For large fact table loads which are typically the most time-consuming ETL operations, UW can represent a specific data subset (e.g. 1M rows) that is processed and loaded into an empty partition. Similarly, for compute-intensive processing it can be subset of the input rows. When you have a large number of sources, the UW can represent each source query. It is important to ensure that selected units of work are independent and don’t interfere with each processing group. Selecting the optimal size is important to minimize queue/messaging overhead (think of each UV as a data set and not individual row).

Advantages / Disadvantages:

  • Allows optimal utilization of ETL server resources
  • Supports scaling out of processing by adding additional ETL servers.
  • Improved ability to isolate and handle errors: it is a lot easier to isolate, correct and re-run specific UW as opposed to re-running whole ETL orchestration process
  • The downside is the increased code complexity: each package needs to be “smart” so it can independently request UWs, process them and handle the errors.

Optimizing ETL performance can often be a daunting task. Attacking it brute force-style by adding more hardware often does not provide expected result due to outlined single-threaded architecture of the process. I hope that outlined steps will provide you some practical thinking points in your ETL optimization pursuit and help you achieve your goals.  

Are you experiencing ETL performance issues, planning refactoring or planning new design? Let’s talk.

Your email address will not be published. Required fields are marked *

Phone: 312-602-4000
Email: marketing@westmonroepartners.com
222 W. Adams
Chicago, IL 60606
Show Buttons
Share On Facebook
Share On Twitter
Share on LinkedIn
Hide Buttons