When I was growing up, my dad used to tell me “Work on something you love and it will never feel like work.” In our Advanced Analytics practice, our team shares a common love of data and complex problems. We work on exciting challenges, pulling together experiences to determine the right tool for the job and deliver uncommon results. One recent challenge fits that criteria, incorporating America’s most popular pastime – Football. The best and brightest on our Advanced Analytics team collaborated to extract, enrich, and visualize NFL Twitter data using cutting edge technologies. This project helps exemplify a use case for big data technologies, natural language processing, and data provisioning for visualization. In coming weeks, the team will be releasing a series of blog posts detailing the implementation of our solution and rationale behind design decisions.
First things first, the end result is live (as of 2/1/2015). Click here to view the site!
Big Data Technologies:
Since the documentation release of Google File System and BigTable, Big Data has been hyped for advantages like commoditized hardware, distributed processing, and schema-less import. The three, four, five V’s come to mind for valuable use cases for Big Data: Volume, Velocity, Variety, Veracity, and Value. When further understood, we see Big Data technologies as new tools with potential to solve certain problems faster, cheaper, or easier than more established solutions. On this project, WMP used Hadoop as a tool for three primary reasons:
- Very little understanding or control over the schema of the data resulted in a need for a dynamic data storage system which led us to the decision to utilize schema-less benefits of Big Data technologies. In choosing a tool, we chose Hadoop over popular document store databases like MongoDB due to its robust and relatively easy to use language processing integration.
- Unknown volume and potential for event driven spikes in Tweets drove need for scale out capabilities. Using a distributed configuration granted us cheap hardware and ability to flexibly scale our application.
- A distributed environment powered our ability to run more complex computation such as sentiment analysis distributed across hardware. This environment configuration also provided more options for distributed real time analysis.
By using Hadoop we were able to deliver a scalable and affordable solution while accelerating time to market for data exploration, just in time for pre-season Ray Rice tweets.
Natural Language Processing:
Not to be outdone by the Big Data hype, Data Science is climbing to the top 2015 buzz words for data analytics. On our Advanced Analytics team, we have skilled subject matter experts that deliver computation driven solutions such as predictive analytics, language processing, and statistical analytics. One key challenge in the field of Data Science is finding the right tool for the job. Raw tweet data meant very little about the opinion of the tweeter and the rationale for the tweet, and it is difficult to obtain structure around what topics are being discussed. It is valuable to understand whether a fan might be supportive of a loss or giving up hope on their team after another bad season. Python and R in Spark and MapReduce were chosen for distributed natural language processing to determine sentiment of the tweets. As you see on the dashboard, this enrichment enables a dimension of reporting on the front end that would not be possible with raw data alone and helps tell a better story for identifiable trends in volume.
Provisioning & Visualization:
Once the schema was explored and certain reports were identified, the team applied relational best practices to provision the data in a usable way for the reporting layer. This was implemented in a SQL Server star schema on Azure, supplemented with an SSAS in-memory tabular cube. An API layer was implemented between the cube and the reporting front end to standardize report development and further secure the database layer. Azure was chosen to simplify and expedite the hardware setup and to allow for easy hardware configuration changes.
It is important to note that SSAS tabular cubes are commonly supported by leading visualization products such as Tableau and Reporting Services. For this project, the dashboard component could have been fulfilled utilizing a best in breed visualization tool like the abovementioned, but we chose the open source charting library D3 due to the granularity of controls and flexibility to depict data. D3 is lightweight and mobile-capable out of the box and flexible enough to allow for unknown future customization needs.
Planned Future Content:
- Big Data & Football: Elaboration on the tools used for this proof of concept including best practices and lessons learned along the way.
- Big Data & the Middle Market: Additional use cases for Big Data technologies in the middle market and considerations for choosing the best tool for the job.
- Sentiment & Football: Elaboration on best practices for sentiment analysis using Python as it relates to this project. There are many good vendors that can deliver quality sentiment analytics, but focusing on the right tools with the right questions asked of the data presents a whole new level of insights and flexibility.
- Data Procurement & Football: Elaboration on applying schema to raw message data and structuring data for performant reporting.
We welcome your feedback to help guide future hackathons and projects. Data and football, how cool is that? #WMPChallengeAccepted
Project Contributors: Matt Kosovec, Vadim Orlov, Andrew Platkin, Jeremy Wortz
Special Thanks to the WMP Technology Practice, Letteer Lewis, Ted Nubel, Jasmine Jones, and Marlee Maclean