The Power of Correlation Models
One of the main reasons we model data is to better understand the relationship between variables. With the advent of Big Data, analysts gained access to a large pool of raw, historical, data, unlike anything they’ve had before. Modeling used to require some sort of hypothesis, followed by data collection, but Big Data has paved the way for correlation models. These models don’t require a hypothesis and represent actual correlations between two variables, regardless of a direct relationship. Building correlation models can be a far more efficient way to capture, create, and analyze large data when compared to causation models.
A well-cited example of this principle is the 2009 debut of Google Flu Trends. A group of researchers working at Google outpaced the Center for Disease Control (CDC) in their tracking of the spread of influenza across the country by monitoring Google search queries.
Google Flu Trends remained successful and accurate for three years, but inexplicably saw a significant drop-off in the accuracy of the model’s results in 2012.
This leads us into one problem with correlation models: when our models break down, we don’t have an immediate starting point to begin diagnostics. Without a deeper understanding of the model’s inputs, we’re pigeonholed into a sort of guess-and-check game.
When Google Flu Trends broke down, the researchers scratched their heads. Some believed an outpour of flu-related news articles at the time caused spikes in related Google searches, subsequently throwing off the model’s results; others believed that a change in consumer travel habits led to different infection chains. Ultimately, there was no immediate way to definitively know what went wrong, nor which variables were to blame.
Google was able to update its algorithm, but it took a significant amount of time and effort. At a high level, the problem with their model was the data it was using. They didn’t vet the raw queries to determine what was statically significant; instead, they aggregated the entire data set and looked for key words.
Statistical Error & Big Data
Classical statistics teaches of a common error that’s simply referred to as the “multiple-comparisons problem.” Large data sets tend to jump to statistical inferences without considering the actual parameters they measure against. Consider the results of a hypothetical new drug’s clinical trial. A surface analysis of the results might suggest that this drug accomplished its goal, whatever that may be, but that conclusion doesn’t take the uniqueness of the participants into consideration. What if the drug only worked in the trial due to a set of specific conditions that the trial’s participants exhibited? Maybe these participants have specific genes that are required for the drug to take effect, and maybe these genes aren’t common outside of the United States.
Big Data exhibits the multiple-comparisons problem to a higher degree as the permutations within the data set increase. And if we use Big Data to fuel a correlation model without considering statistical significance, our models can break akin to Google Flu Trends.
It’s certainly possibly to test the validity of a correlation model, but it requires a lot of effort, transparency, and still leaves the user without full confidence in the model. It’s hard to justify putting additional effort toward any correlation model, seeing as the lack of required effort is one of the main reasons you would build the model in the first place. And in a realistic scenario, businesses would hesitate to be transparent with their data sources to avoid any competitive disadvantages.
We need to approach Big Data sets with caution and see them for what they truly are: messy hodgepodges of raw data. This type of analysis, like all forms of analysis, applies well in certain situations, but should not be considered a blanket solution. If you’re going to make an enterprise-scale investment, a causation model may be well worth the money. Downtime on any decision model could have significant repercussions, often measurable dollar loses. Blindly trusting Big Data and correlation models can prove to be a damaging decision model.