Clients frequently ask us to help with analytics data platform selection. At the Ignite 2015 Conference in Chicago earlier this month Microsoft summarized its current data platform technology options:
Above image sourced from here.
This blog post will focus on the storage layer of the Microsoft data platform (highlighted above) and provide guidance on optimal technology selection. There are 4 options available within platform that are suitable for medium to large scale analytics platform.
Key illustrative characteristics for each technology are summarized in the table below:
|Max Scale, TB||
|Features and functionality||Cost|
|SQL Data Warehouse||SQL DW||Relational||
What factors should be considered for data platform decision?
I selected 4 that tend to be most impactful in my opinion:
This is the key decision factor. It is measured as a set of 2 numbers:
- Total volume of data the platform is expected to handle. This is important to understand impact on query performance as your data set grows
- Maximum incremental (e.g. daily) load volume. This impacts platform loading performance
Estimated ranges are expressed as [Total Volume]/[Incremental Volume]. The actual numbers are just the guidelines, they are constantly increasing with evolution of hardware
|10/0.5||Up to 10TB total volume and/or up to 500GB incremental load|
|50/2.5||Approximately 50TB total volume and/or up to 2.5TB incremental load|
|100/5||Over a 100TB total volume and/or over 5 TB incremental volume|
Data structure illustrates types of data that your platform is expected to handle. This will be important to determine best match to each platform processing capabilities
|Unstructured||Blobs, files, text, images, binary data|
|Semi-structured||Combination of relational data, documents structured as JSON/XML and unstructured content (text)|
|Structured||Normalized relational tables, structured documents (JSON,XML) with well-known schemas|
This is broader category that aims to measure how complex are your data manipulations in the platform.
|Simple||Straight copy, basic lookups, joins, aggregations|
|Complex||Parsing of complex documents, text analysis, image processing, complex multi-factor de-duplication|
This category describes expected general latency of your system. Does data need to be available immediately? What are SLAs on the query/report response time?
|Batch||Loading frequency ranges from traditional once-a-day loading to hourly mini-batches. Query performance SLAs support batch report generation (over 30 seconds)|
|Real-time||Loading frequency ranges from real-time to micro-batch (minutes). Query performance SLAs support interactive analysis and data exploration (sub-seconds)|
A good understanding of all use cases and requirements for your data platform and its future evolution will be required to accurately rate platform of the 4 categories above. Obviously those ratings are not 100% scientific and may be a little subjective but it should give you a good idea to narrow down you selections. How do we translate these categories into platform recommendation?
Data volume is by far the most impactful factor to help in your platform selection. Let’s look at 3 outlined scenarios.
Scenario 1: Total Data Volume under 10TB / Incremental Data Volume under 500GB
In this scenario SQL Server platform typically provides best cost/performance ratio for use cases. Having mature and proven platform helps decrease complexity, obtain development resources, and minimize support and maintenance costs.
In extreme cases of data variety HDInsight cluster may be a better option given its native ability to handle unstructured data.
With extremely complex processing requiring near-real time latency Azure SQL Data Warehouse (SQL DW) or Analytics Platform System (APS) may be considered
Scenario 2: Total Data Volume 10-100TB / Incremental Data Volume under 0.5-5TB
In this data volume range, SQL Server can still handle simple, structured, batch-oriented loads but its performance and effectiveness diminishes for less structured, low-latency scenarios.
HDInsight cluster should be considered for unstructured data storage and processing.
For complex processing and low latency SQL DW or APS will deliver best performance. As seen from the graph, there is an overlap between HDInsight and SQL DW – it indicates that either technology could be applied or both can be combined to achieve desired result. SQL DW and APS both provide seamless integration with HDInsight.
Scenario 3: Total Data Volume over 100TB / Incremental Data Volume over 5TB
This is now true Big Data territory. While properly sized SQL Server platform can still handle some basic loads, its effectiveness is severity diminished while costs are very substantial.
HDInsight cluster will easily scale out into multi-petabyte territory to handle unstructured data loads.
SQL DW delivers very good performance/cost ratio in this group due to its ability to scale compute and storage components separately. You only pay for processing power when it’s needed. SQL DW is expected to scale into 1PB area.
APS is the best performer in this group – it promises linear scalability into 5PB range while seamlessly integrating with cloud HDInsight or on premise Hadoop platform. As expected, it also carries the highest cost.
An overlap between HDInsight, SQL DW and APS highlights the fact that you may need to combine relational and non-relational technologies to address processing needs for highly unstructured data.
While there is no quick recipe when choosing your future data platform, I hope that provided categories and scenarios should give you a good start and point you in the right direction in your planning process.