The term Big Data is getting a lot of attention in technology circles lately, but there is still a lack of clarity over what Big Data really means. Whenever I hear groups of people discuss the topic they are often talking about vastly different scales of problems. Often one person’s notion of big data is completely different and even incompatible with someone else’s. At a recent presentation on the topic an engineer from Microsoft mentioned that big data problems are really Time Space Complexity issues.
This will take any computer scientist back to their analysis of algorithms (or similar) class and the concept of Big O Notation. I have fond memories of CS 321, but if you don’t it’s enough to say that computer algorithms are rated according to the resources necessary to execute them. Because algorithms can theoretically work with an arbitrary amount of data, this is a process of relating the proportional size of the input dataset with either the space (memory / storage) or the steps (time) required to solve a problem.
big data is generally split along these same two dividers with storage and processing power representing space and time. In the past the solution to both sets of problems was grid or cluster computing using many systems to process one large problem. This is starting to change though. The big data ecosystem is segmenting into large amounts of data and large amounts of processing. The two extremes are increasingly represented by specialized approaches to each problem set such as Hadoop on the quantity end and General Purpose Graphics Processing Unit (GP GPU) on the processing end.
Memory bound calculations are calculations involving large sets of data: such as finding the average over trillions data points. This isn’t a complex calculation, but it requires a large amount of memory to process often just to hold the data. Hadoop is a great solution for extremely large data sets because it sends the program to the data (rather than pulling the data to the program). Hadoop scales very well on inexpensive hardware and is ideally suited to problems involving large sets of data.
Computationally bound calculations are those calculations that involve complex calculations (or many calculations), normally over smaller sets of data. These are problem sets that are limited not by the amount of data, but by the computational complexity of the solution or algorithm. Matrix multiplication is a classic example of this problem. Weather simulation and bioinformatics are other problem sets limited not by the data, but by the ability to carry out many complex calculations. GPU solutions are increasingly being used to address these types of problems. The latest NVIDIA Kepler GPUs can process 4.5 trillion floating point operations per second – and they can be chained together.
As big data grows more into the main stream each of these types of solutions will continue to expand and mature.