Recently, Microsoft released to General Availability, the newest version of Azure Data Factory (which you will see referred to as V2 in Azure). This is a major release that is a huge step up from the previous version of Azure Data Factory (read about the new features here). One of the most interesting new features is support for SQL Server Integration Services (SSIS) packages. You can now set up an SSIS run-time environment when you spin up Azure Data Factory.
As a technology consultant, I find the newest Azure Data Factory release to be one of the most intriguing Platform as a Service (PaaS) options available for data processing right now. You get all the functionality of a full-featured ETL tool without the overhead of managing a SQL Server install. If you tried the launch version of Azure Data Factory and walked away disappointed (like I did), give the new version a shot. Spinning up an SSIS instance in Azure Data Factory is easy to do, and easy to try out. You may be pleasantly surprised with this new release.
Why would I want to run SSIS in Azure Data Factory?
SSIS in Azure Data Factory offers most of the same features that you would get in a normal SSIS installation. In addition, you get some of the same networking features you would expect from a VM.
Some of the main highlights are the following:
- Full SSIS Code Compatibility: You can easily jump into the cloud with a tool (SSIS) that you or your staff may already have familiarity with. All the power you get from developing packages in Visual Studio is still available for your use. Moreover, you can leverage existing SSIS packages previously developed for on-premise use. You can most likely deploy the same code on the SSIS run-time without any code changes (or possibly minor code changes).
- Project Deployments: If you are familiar with the Project Deployment model, you will be right at home using SSIS on Azure Data Factory. You can use the same SSIS deployment wizard that you use for on-premise deployments. All you need to do is deploy to the same server where your Azure database instance housing SSISDB resides.
- Full Access to the SSIS Catalog: SSISDB is exposed as an Azure SQL Database, allowing you to connect to it the same way as you would with an on-premise installation. You can also continue to use your existing library of SSISDB scripts and queries.
- Built-in Execution Reports: All the normal execution reports you are familiar with in SSIS have made it over on this release.
- Support for Virtual Networks: You can configure the SSIS runtime to join a virtual network (much like you can on normal virtual machines). This allows you to access on-premise or other non-external facing resources from the SSIS runtime.
How does a complete implementation of SSIS in Azure Data Factory look like?
Being a Microsoft Azure PaaS offering, the SSIS runtime in Azure Data Factory works best accessing other Azure resources. However, you can also access your on-premise network by using a site-to-site VPN or an ExpressRoute circuit to Azure. The diagram above shows how the SSIS runtime communicates with both Azure and on-premise resources.
To set up SSIS on Azure Data Factory, you will need the following components:
- An instance of Azure Data Factory (V2)
- A blob storage account: Used for SSIS startup logging and configuration scripts
- An Azure SQL Database server or Managed Database server: Used to store the SSISDB database
- An SSIS runtime: This is the runtime that SSIS packages run under
- A virtual network (optional): Needed if you want to access private Azure resources without opening up public endpoints
- A VPN gateway or ExpressRoute circuit (optional): Needed if you want to access corporate / on-premise network resources
Sounds good – anything else I need to know?
Does this all sound good and you want to try it out for yourself? Great! However, there are some differences in what you need to consider over a traditional on-premise deployment. Keep the following tips in mind as you get started:
- Pausing the SSIS Instance: You might consider pausing the SSIS runtime during periods of inactivity to save on Azure spend. However, you will need to wait approximately 30 minutes for the runtime to become available after restarting it. Make sure to account for this delay in your SSIS runtime pause / restart strategy.
- Editions and Licensing: If you have Azure benefits with your existing SQL Server licenses, you may be able to save money by using your own license. Check that your license allows for this before turning on this option. If you are undecided between using a Standard or Enterprise instance, see here for a list of additional features provided by the SSIS runtime when using Enterprise edition.
- On-premise Data Access: If you have on-premise data sources that you need the SSIS runtime to access, you will need to set up a site-to-site VPN or an ExpressRoute circuit to access your on-premise network. Join your SSIS runtime instance to a virtual network to take advantage of this. For testing purposes, you can set up a point-to-site VPN in a pinch to test access to specific on-premise sources. Note that this requires each needed on-premise resource to have a VPN client installed, and resources may change IP addresses. Because of this, I do not recommend using point-to-site VPNs as a permanent solution here.
- Runtime Sizing: The SSIS runtime sizing options are provided here. Start out by sizing this instance the same way you would for a dedicated SSIS box.
- SSIS Catalog Sizing: The SSIS catalog will live on an Azure SQL Database. Since SSISDB is usually a low-traffic database, I would recommend starting out with Basic sizing. You can always easily scale this database up if you run into performance issues.
- Package Versions: As of the time of this writing, the SSIS Runtime is compatible with packages developed for SQL Server 2017. To avoid possible issues, make sure you upgrade packages developed on earlier versions to SQL Server 2017 before deploying.
- Project Deployment Model Only: The SSIS runtime only supports the Project Deployment Model that has been supported in SSIS since SQL Server 2012. If you are using packages built on the older Package Deployment Model, update them to the Project Deployment model first.
- Package Scheduling: Unless you are using a Managed SQL instance (currently in preview) for SSISDB, you will not have SQL Server Agent available as an option for scheduling. While you can use an existing on-premise SQL Server Agent instance to schedule packages on Azure Data Factory, I would recommend using Azure Data Factory pipelines to kick off your packages. (While scheduling packages directly in Management Studio creates Data Factory pipelines as well, creating pipelines and triggers yourself gives you control over naming conventions).
- Credentials for File Shares: Credentials for fileshares are configured in a custom setup script used to spin up the SSIS instance. Below is an example main.cmd file snippet to save credentials for an Azure File Share. If needed, add more CMDKEY calls in the same script if you have additional Windows or Azure file shares to configure.
@ECHO OFF REM Change the following items below: REM /add: Change to FQDN of storage account REM /user: Change to name of storage account with AZURE as the domain (ex: if storage account is named mystorage, user is AZURE\mystorage) REM /pass: Change to access key for storage account. ECHO Adding credentials for Azure file storage account CMDKEY /add:myfilestorage.file.core.windows.net /user:"AZURE\myfilestorage" /pass:"<myaccesskey>"
- Installing Custom Components: If you have custom components you are using in your SSIS packages, you will need to include them as part of the custom setup script that is called when spinning up the runtime. For each component, you will need an install file for that can run in silent mode (without any interactive inputs). If you have any paid components you leverage, make sure that they support SSIS in Azure Data Factory first. License checks binding to a specific server may have issues with the dynamic node allocation scheme used to spin up the SSIS runtime.
Moving your BI infrastructure to the (Azure) cloud
Microsoft has had great options for cloud data storage with Azure SQL Database, Azure SQL Data Warehouse and Azure Analysis Services for some time. In addition, Power BI has been available as a cloud-based solution for dashboards and reporting. However, up until recently, they have been lacking in the traditional ETL capabilities in the Azure stack. The initial version of Azure Data Factory was severely lacking in functionality / maintainability, and SSIS (or an equivalent ETL tool) running in a VM was the only viable option if you needed the power of a fully-featured ETL tool. The newest release makes significant strides by leveraging the power and functionality of SSIS within Azure Data Factory. While there are some differences setting up Azure Data Factory over traditional SSIS deployments, having the full power of SSIS available without needing infrastructure is a very unique offering. Since SSIS is a mature tool, the pool of people with SSIS skills is very sizeable. Having code compatibility with traditional SSIS installations is a great feature that vastly lowers the barrier of entry and eliminates most of the training needs to get up and running with development.
Hopefully this post has been a good introduction for you to the newest release of Azure Data Factory, and I have been able to pique your interest in the offering. If you have been wanting to move your BI infrastructure to the cloud, take a look at the newest Azure Data Factory release. You may find that it fits the bill for your ETL engine in the cloud.