Tips on Selecting Microsoft Analytics Data Platform

Clients frequently ask us to help with analytics data platform selection. At the Ignite 2015 Conference in Chicago earlier this month Microsoft summarized its current data platform technology options:

Above image sourced from here

This blog post will focus on the storage layer of the Microsoft data platform (highlighted above) and provide guidance on optimal technology selection. There are 4 options available within platform that are suitable for medium to large scale analytics platform.

Key illustrative characteristics for each technology are summarized in the table below:

Technology Acronym Type


Max Scale, TB

Product maturity

Features and functionality Cost

SQL Server Relational


APS Relational


SQL Data Warehouse SQL DW Relational


HDInsight Non-relational Hadoop



What factors should be considered for data platform decision?

I selected 4 that tend to be most impactful in my opinion:

Data Volume

This is the key decision factor. It is measured as a set of 2 numbers:

  1. Total volume of data the platform is expected to handle. This is important to understand impact on query performance as your data set grows
  2. Maximum incremental (e.g. daily) load volume. This impacts platform loading performance

Estimated ranges are expressed as [Total Volume]/[Incremental Volume]. The actual numbers are just the guidelines, they are constantly increasing with evolution of hardware

10/0.5 Up to 10TB total volume and/or up to 500GB incremental load
50/2.5 Approximately 50TB total volume and/or up to 2.5TB incremental load
100/5 Over a 100TB total volume and/or over 5 TB incremental volume

Data Structure

Data structure illustrates types of data that your platform is expected to handle. This will be important to determine best match to each platform processing capabilities

Unstructured Blobs, files, text, images, binary data
Semi-structured Combination of relational data, documents structured as JSON/XML and unstructured content (text)
Structured Normalized relational tables, structured documents (JSON,XML) with well-known schemas

Processing Complexity

This is broader category that aims to measure how complex are your data manipulations in the platform.

Simple Straight copy, basic lookups, joins, aggregations
Complex Parsing of complex documents, text analysis, image processing, complex multi-factor de-duplication

Data Latency

This category describes expected general latency of your system. Does data need to be available immediately? What are SLAs on the query/report response time?

Batch Loading frequency ranges from traditional once-a-day loading to hourly mini-batches. Query performance SLAs support batch report generation (over 30 seconds)
Real-time Loading frequency ranges from real-time to micro-batch (minutes). Query performance SLAs support interactive analysis and data exploration (sub-seconds)

A good understanding of all use cases and requirements for your data platform and its future evolution will be required to accurately rate platform of the 4 categories above. Obviously those ratings are not 100% scientific and may be a little subjective but it should give you a good idea to narrow down you selections. How do we translate these categories into platform recommendation?

Data volume is by far the most impactful factor to help in your platform selection. Let’s look at 3 outlined scenarios.

Scenario 1: Total Data Volume under 10TB / Incremental Data Volume under 500GB

In this scenario SQL Server platform typically provides best cost/performance ratio for use cases. Having mature and proven platform helps decrease complexity, obtain development resources, and minimize support and maintenance costs.

In extreme cases of data variety HDInsight cluster may be a better option given its native ability to handle unstructured data.

With extremely complex processing requiring near-real time latency Azure SQL Data Warehouse (SQL DW) or Analytics Platform System (APS) may be considered

Scenario 2: Total Data Volume 10-100TB / Incremental Data Volume under 0.5-5TB

In this data volume range, SQL Server can still handle simple, structured, batch-oriented loads but its performance and effectiveness diminishes for less structured, low-latency scenarios.

HDInsight cluster should be considered for unstructured data storage and processing.

For complex processing and low latency SQL DW or APS will deliver best performance. As seen from the graph, there is an overlap between HDInsight and SQL DW – it indicates that either technology could be applied or both can be combined to achieve desired result. SQL DW and APS both provide seamless integration with HDInsight.

Scenario 3: Total Data Volume over 100TB / Incremental Data Volume over 5TB

This is now true Big Data territory. While properly sized SQL Server platform can still handle some basic loads, its effectiveness is severity diminished while costs are very substantial.

HDInsight cluster will easily scale out into multi-petabyte territory to handle unstructured data loads.

SQL DW delivers very good performance/cost ratio in this group due to its ability to scale compute and storage components separately. You only pay for processing power when it’s needed. SQL DW is expected to scale into 1PB area.

APS is the best performer in this group – it promises linear scalability into 5PB range while seamlessly integrating with cloud HDInsight or on premise Hadoop platform. As expected, it also carries the highest cost.

An overlap between HDInsight, SQL DW and APS highlights the fact that you may need to combine relational and non-relational technologies to address processing needs for highly unstructured data.


While there is no quick recipe when choosing your future data platform, I hope that provided categories and scenarios should give you a good start and point you in the right direction in your planning process.

Your email address will not be published. Required fields are marked *

Phone: 312-602-4000
222 W. Adams
Chicago, IL 60606
Show Buttons
Share On Facebook
Share On Twitter
Share on LinkedIn
Hide Buttons