Using statistical and predictive techniques on datasets can generate key insights that businesses may otherwise never realize. The power of such techniques has been a feature of modern analytics that many large companies have been investing in. Still, with all of this energy, talent, and capital flowing into predictive analytics, there have been some growing pains associated with it. Many companies new to the space may commit some critical errors along the way towards building accurate and actionable models. In order to create better business value out of predictive modeling, it’s important to anticipate and control mistakes before they materialize, or at least recognize mistakes early and eliminate them.
In this post, I will discuss three errors practitioners and purchasers of predictive analytics projects may face — ignoring ROI, overfitting models, and using bad data — and ways to anticipate and mitigate them in order to build more valuable models.
Problem: Not Thinking about the ROI
Purchasing a predictive project is a fun process, considering the excitement around the technology and the plentiful success stories that can be found online and in texts. However, sometimes the project doesn’t pay out quite as planned. Early on, managers can see stars instead of dollar signs, and chase analyses that have a low chance to pay off for a myriad of reasons. The marginal gains from the newly developed predictive model may not be realized because the planning stage never considered the end goal seriously – a justifiable and realistic return on investment.
Solution: Begin with the End in Mind
It is important to have a clear goal before starting any project, and predictive analytics is no exception. Why are we doing this project? What will this analysis solve for my company? How will this analysis help in answering the project’s principle questions? These questions should be asked – and answered –throughout the project to ensure it will be a success. If the project seems to be going too fast, take a step back and clarify goals and milestones with your stakeholders so both parties are aligned. Always keep the ROI in mind before, during and after a predictive analytics project to effectively realize the returns you worked so hard to achieve.
Problem: Overfitting Models
Creating predictive models allows one to extrapolate an underlying trend in a dataset and subsequently predict future events. Much of it is using statistical techniques to parse through large and complex datasets. When creating the model, it is easy to fit a model to perfectly conform to the current dataset, even the parts of the data that are not relevant or generated mostly due to chance. This is called overfitting, and it creates a model that is both inaccurate, as well as deceptive. Overfitted models appear more complex while having reduced predictive power, which leads to decision-makers to make poor decisions based on misleading models.
Cross-validation is a technique that allows one to test if a model is overfitted. It tests a model’s strength in different contexts to ensure that all the parameters are useful and have predictive power. There are many styles of cross-validation, and it is imperative that those running the analysis understands the positives and negatives of each cross-validation method and utilizes them as appropriate. Creating a powerful model and then not using model testing techniques is a faux pas that every data scientist should be trained to avoid. Keep aware of the problem of overfitting, and don’t hesitate to question if a model has been properly tested before accepting the model as proper and useful.
Problem: Using Bad Data
There are many ways data can be wrong, such as user error, ideological bias, random noise, sensor malfunctioning, accidental deletion, rounding error, etc. Data scientists have to be extremely mindful of the data quality because building models based on bad data will build an inaccurate model. Many times bad data is not dealt with, or plainly ignored. Some may even assume that modern statistical techniques are able to pick up and correct for any kind of messy data one may encounter. Both of these ideas are incorrect, as it is important to deal with the data before your analysis becomes distorted and faulty. As with all of data science, domain-area knowledge is key, and it will allow one to better be able to check for – and potentially fix – bad data.
Solution: Data Cleaning
There are many ways to clean data, but encountering large and complicated datasets can make data cleaning seem like a daunting task. However, there are tools that can help automate the process of data cleaning and examine the dataset early on. At West Monroe Partners, we leverage an internally developed data profiling engine to identify data anomalies and clean up the data prior to analysis. Following well-known data cleaning standards, the data cleaning process first goes through univariate analysis and moves on to perform multivariate analysis. Univariate analysis involves looking at one variable at a time, utilizing techniques such as mean, median, outliers, minimum/maximum, and missing data detection. Multivariate analysis follows, looking at relationships between variables through analysis and visualization, seeing if the relationships make sense in the context of the data. Cleaning data is more of an art than a science, and as with many art forms, it takes a lot of time and requires a trained eye. Before embarking on any analytics task, always be checking the data, constantly asking questions, and remain perpetually skeptical of every piece of data you are working with.
As Warren Buffet famously said, “Risk comes from not knowing what you’re doing.” When it comes to predictive analytics, the same concept applies. Learning to anticipate and avoid critical errors will allow one to create better models and create immense business value. The field of prediction is large and will continue to grow. Big data, machine learning, data discovery, and customer analytics all require statistical heavy lifting to create value out of the growing information we have access to. It’s up to businesses to understand how to better predict the future, integrate modeling techniques into their business processes, and make important, game-changing decisions in the marketplace.