Recently, about a dozen of us at West Monroe spent an afternoon getting familiar with Kaggle, an online data science community centered around sponsored competitions. This was the first in a series of hackathons where we aim to expose ourselves to new tools and methods of data analysis.
The Kaggle competitions range from the introductory ‘Titanic: Machine Learning from Disaster’ to the 2017 Data Science Bowl, an effort to crowdsource interpretation of lung cancer imaging with a $1,000,000 cash prize. Other features include discussion forums on all topics data science and job boards for both recruiters and job seekers. In total, Kaggle’s community includes over 500,000 data scientists across 200 countries.
If you’ve never used Kaggle, it’s very easy to get started. Just create an account and navigate to ‘Competitions’ to look at your options.
With so many options available to you, it’s easy to get lost trying to decide where to start. This leads us to our first tip:
1. Work with a goal in mind. The best way to get the most out of a Kaggle competition is to choose a specific tool or area in which you want to build your skills, and stay focused on that. It’s easy to become distracted by all of the different tools, code bases, and competitions out there, but you will learn more if you hone in on one particular goal or task. Since I’m currently learning Python, I decided to use Python for all of my analysis.
2. Think statistics. Data science combines statistics with computer science to produce programs and algorithms that can ingest data and return meaningful insights. In order effectively analyze data, you need statistical knowledge as well as programming skills. Before we started on our Kaggle competition, we reviewed some basic statistics knowledge, which provided a base for the analyses we would perform using code.
“At its core, data science involves using automated methods to analyze massive amounts of data and to extract knowledge from them.” – NYU Data Science
Here is a quick recap:
Linear regression. One of the most basic analysis tools, linear regression is simple but often extremely useful. Regression means finding the line of best fit to a number of data points based on each point’s squared distance from the line. Regression can also be used for multivariable analysis. It is useful for extrapolation when working with correlated variables.
Classification. Classification involves separating data points into discrete groups and using those groups to infer further information. For example, you may want to teach an email application to tell spam from real mail.
Anomaly Detection. Anomaly detection is the identification of points that lie outside the normal range of a dataset. Also called outliers, these points can be helpful when trying to pinpoint things like bank fraud or defects. Anomaly detection can also be helpful when cleaning up datasets; sometimes outliers are the result of errors in data collection.
3. Take advantage of Kernels. One of Kaggle’s coolest features is the access to other users’ shared code bases. They are under the ‘Kernels’ tab. Note that you can filter them by tool and sort in order of date, number of votes, and ‘hotness’. To get a running start, you can click into a kernel and click ‘Fork Notebook’ or ‘Fork Script’ to create a working copy for yourself.
Once you have your copy, you can work to improve that code rather than starting from scratch.
4. Consider your toolset. There is a huge variety of data science tools available on the Web, many of them 100% free.
Python, a general purpose programming language, is the most flexible and widely applicable of the tools mentioned here. If you want to learn to code, Python is a great choice because of its ease of use. Python also has a good selection of libraries that are tailored toward data analysis; a few of the most popular are Pandas, NumPy, and SciPy.
R is a programming language that was created specifically for statistical computing. It is the dominant language in the field of data analytics, and is widely supported by data science tools. It is very easy with R to implement statistical models and visualizations; it’s harder to use R if you want to do things outside the realm of statistical analysis.
Azure Machine Learning, released in 2015, is part of Microsoft’s collection of cloud services. With Azure ML, you can use drag and drop tools to create predictive models. Since this is a relatively new tool, I’ve included an example of the interface below.
Power BI is a fast-growing player in the data visualization sector. It is low cost and easy to use, and new features are released monthly. Take a look at the modern-looking visualizations you can create below.
In conclusion, if you are an aspiring data scientist, or are just interested in analytics, Kaggle is a great way to get started with some guidance from the community. Personally, I was able to use the Titanic competition to apply the Python I’ve been learning to real analysis tasks. I was even able to reuse one of the analyses I ran to answer some questions about a dataset at my current client.
Looking forward to our next hackathon: computer vision for handwriting analysis.