Edward Snowden may be on holiday in Russia right now, but the disclosures he left behind him are still creating waves around the world. I recently wrote about how the NSA surveillance program we now know as PRISM is actually a profound validation of the concepts and technologies of Big Data. There clearly is value buried deep within the massive volumes of data that are generated by our modern society. As America comes to grips with the impact and implications of this data and the existence of programs like PRISM, we are now reassessing exactly how we want our government to behave with such sensitive information.
One of the biggest pushes on this front is to limit how much data the government collects itself. One of the leading ideas is to keep the information stored on the servers and networks of the telecommunications and internet companies that generate and facilitate this data; that is on private servers rather than government servers. This represents a huge shift from the large spread data ingestion process currently in place. Legal, ethical, and constitutional arguments aside, this presents an interesting new step in the evolution of Big Data technologies – namely the further distribution and decoupling of the data from anything like a centralized data store.
Indeed the main Big Data platform, Hadoop, has taken a large step away from the preceding dominant model: the centralized relational database. Whereas a relational database stores everything together and pulls the data to processors to query and analyze, Hadoop sends the query to the data – where it resides across a massive network of servers and equipment. The next step will take this to an even higher level of abstraction and distribution – sending queries out from government computers to run across these private networks of telecommunications and internet companies.
Even the NSA seems to be supportive of limiting their collection of data and keeping much of it on multiple private corporate networks. For one, this will save them a tremendous amount of money in terms of storing, organizing, and maintaining this information and two, it will fall more into line with the legal and cultural expectations of our society.
This concept isn’t really that new. Some very smart people at some of the largest technology companies in the world are already working on this model of distributed heterogeneous query (check out this research paper http://academic.research.microsoft.com/Paper/1795770.aspx). Microsoft is not the only company looking at this and there are even some open source groups like Hortonworks and Apache Hadoop that are working on bringing this capability closer to reality. This move, which may be forced by legislation in the case of the NSA, will further accelerate this shift and move us away from the ETL model of “data warehousing” and into the on-demand distributed query era.
This shift will likely not be fast, it will take time and investment, but it will happen – with or without legislation to force it. The result will be a simpler implementation process for enterprise scale big data platforms that produce better accuracy with less upfront effort and lower implementation impact.