What’s more satisfying than finding a way to combine two of your passions? I asked myself this as I cranked out Python code to process online fish stocking records in a recent fishing analytics pet project. But let’s take a step back first. Our CEO Kevin McCarty was recently published in a Forbes magazine article discussing how tech nerds become successful commercial technologists. The article highlights the challenge of learning how to apply technology to generate business value. I’ve learned how to do this in my career thus far as a technology consultant serving the middle market. How would I apply this to my favorite hobby – fishing?
With summer fishing season now in full swing across the Midwest, I dusted off a dataset I found over a year ago. Back when I was working out of our Minneapolis office last year, I identified a dataset with historical fish stocking records from the Minnesota Department of Natural Resources (DNR). I had my fishing analytics use case:
Gain visibility into Midwestern fish stocking to target well-stocked lakes and increase fishing odds
Across the states of the northern U.S. Midwest, the availability of online aggregate fish stocking records turned out to be very strong (with the exception of Illinois – my home state of course). The key challenge in trying to quickly generate insights from these datasets, i.e., “business value,” as I learned was three-fold. Firstly, the data structure of one of the sources was not directly consumable for fishing analytics. Secondly, the same data source did not contain the location of the fish stocking (i.e., the lake, river, pond) in coordinates to visualize it on a map. Thirdly, a couple of the other data sources recorded the location in Public Land Survey System (PLSS) coordinates (which is not a usable format for geospatial analytics by most common data visualization tools).
My background is mostly in data engineering, primarily on the Microsoft BI stack. I quickly realized I needed to dust off my programming chops to overcome these challenges and meet the use case. Let’s start with the first challenge.
Key Challenge #1: HTML Web Scraping
That first fish stocking dataset that I found back in 2016 from the Minnesota DNR unfortunately was not downloadable. It was embedded in HTML elements in a list format. As a result, I developed an HTML web scraper program in Python using the Beautiful Soup library:
- Iterate through each available year’s web page of stocking records
- Ingest the web page’s raw HTML
- Parse the data elements based on the tagging schema
- Generate an aggregate CSV output ready to be processed for analytics
Key Challenge #2: GIS Data Acquisition
Another issue with the Minnesota dataset was that it did not contain geospatial coordinates. Luckily, I tracked down a Minnesota lakes shapefile and converted it to coordinates using a colleague’s blog post on open-source Geographic Information Systems (GIS). I tested how well the Minnesota fish stocking data matched on lake name to my new lake coordinates file. It worked well for a large portion of the stocking data. However, there was a long tail of lakes that this did not work for mostly due to duplicate lake names across counties. I guess that’s to be expected when working with data from the “Land of 10,000 Lakes” (over 11,000 in reality). I found a public web API from the FCC that returns the county and state for a given set of coordinates. As a result, I developed another Python program using the Requests library:
- Parse the mid-range latitude and longitude (which I processed outside of this Python code) for each Minnesota lake from the coordinates-extracted shapefile
- Send the coordinates to the public web API
- Receive a JSON response and parse out the county name
- Re-generate the lake coordinates file with the county name to include in matching to the stocking data
Adding the county to my matching algorithm improved the ability to combine these datasets. There were still some lakes in the stocking dataset missing coordinates. I decided to manually reconcile a few of the higher-stocked lakes, leave the rest unmatched, and accept having the majority of the data covered. Now the third hurdle – the PLSS coordinates nightmare.
Key Challenge #3: PLSS Coordinates Conversion
After dealing with the Minnesota HTML issue, I was happy to find that both the Wisconsin dataset and Michigan dataset could be exported as CSV files from their respective DNR’s. Better yet, these fish stocking datasets contained geospatial coordinates on each record! But wait, the coordinates were in Public Land Survey System (PLSS) format. Prior to this endeavor, I had minimal GIS experience and had never worked with PLSS coordinates before. I thought there had to be a mathematical equation I could use to simply convert these to latitude/longitude, and I’d be on my way. Wrong! PLSS is a rectangular hierarchical plot system of townships, ranges, and sections within each state. After extensively searching for a solution, I finally found a free online converter with an open ArcGIS REST API. I developed yet another Python program using the Requests library:
- Parse the PLSS coordinates from each Wisconsin and Michigan fish stocking record
- Send the PLSS coordinates to the web API
- Receive a JSON response and parse out the latitude and longitude coordinates
- Generate an aggregate CSV output with usable coordinates for geospatial analytics
The Data Architecture
Acquisition & Processing
The data acquisition and processing solutions (as described above) can be seen in the above visual. As shown, the North Dakota dataset (acquired from the U.S. open data catalog) did not require any additional processing before integrating with the other data sources. Kudos to the North Dakota DNR.
For the data integration layer, I initially attempted to use Microsoft Power BI’s ETL (Extract, Transform, Load) toolset as a lightweight option. Since after all, this was just a pet project. About halfway through building this layer, I swapped Power BI out for our own internal data integration tool called the Rapid Analytics Platform (RAP). Among its many advantages, I chose our RAP platform for its extremely low ramp-up time and development effort due to its configuration-driven, automated approach. It offers these advantages over traditional ETL tools without sacrificing functionality or performance. Traditional ETL tools require the effort of creating custom pipelines, data layers, file and table schemas (the list goes on). Once integrated, the data was ready to be presented in a visually appealing way.
Finally for the visualization layer, I chose Microsoft Power BI for its easy development, accessibility, and custom visuals. I attempted to use the ArcGIS Maps for Power BI custom visual from the Power BI marketplace for more advanced and premium geospatial analytics capabilities. However, it turns out that this plug-in is not available for embedding in web pages. This turned out to be a deal breaker since I wanted to embed the final dashboard in this blog post. As a result, I settled for Power BI’s standard map visual in order to have a functioning dashboard embedded in this post (see below for the live public dashboard!). Enough on the architecture – let’s get to the exciting part.
The Final Output
It can be easy on any technology implementation to lose sight of the end in mind, the reason for doing it, the ultimate business value to achieve. So to remind everyone, the fishing analytics use case that I set out to solve was: gain visibility into Midwestern fish stocking to target well-stocked lakes and increase fishing odds. I kept this in mind when developing the dashboard visualizations in Power BI. The main dashboard I developed to solve this use case is the 3-Year Spatial Stocking dashboard. It shows aggregate fish stocking for the past three years across Midwestern water bodies. Custom slicers enable filtering to certain fish species for targeted analysis.
For example, I really love bass fishing. To help determine where my next fishing trip should be, I can select the bass group on the right. I can narrow down to a specific bass species if I wanted to target a specific one. The map will auto filter and zoom to the water bodies with bass stocking in the past three years. I can quickly determine that North Lake in Walworth County in southern Wisconsin is the closest lake to me that appears to have the strongest bass stocking. Based on the custom tooltip calculations that I developed, I can also see that largemouth bass is the top stocked species in that lake, followed by northern pike. Here is a quick video showing this example analysis:
The other dashboard that I developed (State Stocking Trends) shows the full historical dataset. It shows year-to-year fish stocking trends for each state to get a better idea of whether stocking is trending up or down within each state for any variety of species. I could have gone crazy with building all kinds of dashboards, visuals, and views of the data. However, I wanted to keep the solution laser focused on the use case. This tends to be the most effective approach in providing business value in any technology implementation.
At West Monroe, we use our uncommon blend of deep technology expertise and business acumen to create valuable insights from enterprise data for our clients. Fishing analytics was certainly a new use case for this blend of skills that I have sharped over my years at West Monroe. It was really fun combining two of my passions to create something cool. Happy fishing!