Why you shouldn’t use zip codes for your hyperlocal & last-mile analysis

Original article was published on Artificial Intelligence on Medium

Why you shouldn’t use zip codes for your hyperlocal & last-mile analysis

Zip-codes and area boundaries ignore the nuances of properties inside them! Exploring the Modifiable Areal Unit problem in geospatial data science.

Co-author: Rishabh Jain

Striking differences in Market Potential of areas in NYC when calculated by different area definitions.

Introduction

At Locale.ai, we work with a number of last-mile and hyperlocal and mobility companies. For most of these companies, geospatial analysis is critical and typically, they have internal dashboards built on a BI platform or in-house using open source tools.

One of the reasons they use our product is to highlight what areas to focus on and how to contextualize their strategies to those areas. A caveat here is that these areas are not our traditional defined areas in a city or zip-codes. In this post, we will deep dive into why using traditional area definitions is not a good idea to carry out geospatial analysis.

Maps don’t always tell the truth!

Based on intuitive logic, companies either go with a zip-code boundary or some hand-drawn neighbourhood boundaries to map their most critical metrics like market potential, utilisation, average customer value, etc. These form an important part of decisions like what areas to expand into, shoot promotions in, or provision more supply.

Now, if you’re a business with little or no intra-city operations, the difference might not be significant to you. But if you provide services in the last mile or at a hyperlocal level, the differences in insights have a significant impact on the decisions that the business and city teams use.

Which brings me to the next question: Why is this difference significant?

Often what happens is the zip codes or arbitrarily defined neighbourhood boundaries that the nuances that the different areas depict tend to get dissolved.

In other words, a zip code that we treat as one big cell for analysis consists of many smaller cells without any similarity in demographics or economic potential.

Source: Wikipedia

This is illustrated by the fact that, if you live in Palo Alto, you are a part of the world’s foremost innovation hub and are paying median home prices of $1.18 million. However, just right across the train tracks, 18% percent of East Palo Alto residents live below the poverty line where the average yearly income per person is $18,385.

The characteristic behaviour of these two areas would be very different and while they would be present in the data that you collect, they often aren’t so easy to unearth.

In the next section, we would like to show you how easily the maps can lie and why different members in your team recommend completely different areas in a city to focus on based on the same set of metrics!

NYC City Manager’s Woes

Let’s consider that you are Uber’s city manager, and want to run contextual promotions and discounts in areas where you have high market potential or market share. For this exercise, we used the Uber Cab dataset available here and the NYC Yellow Taxi dataset available here. Now, we can simply define market-potential as

For our area definitions, we choose the NYC housing areas and NYC zip-code areas. Using these two area definitions, we calculate the market potential and here’s the plot:

Left and Right show M.P by Zip-Code and Housing areas respectively

Greener means more Market Potential for Uber and pinker means it is already doing well in that area. When we try to filter areas by M.P greater than 70% to identify the top areas to run promotions in, we see the following areas:

Areas with Market Potential greater than 70%

The areas we got are quite different!

Modifiable Areal Unit Problem

Hence, it is safe to conclude that the decision completely changes based on what set of boundaries we use for our analysis. This is a classical problem in geospatial data science known as the Modifiable Areal Unit Problem (MAUP).

Because of the MAUP problem, our decision becomes reliant on the shape and size of the area instead of the actual characteristics of the users within it.

The ideal area should have the ideal shape and size. Let’s consider finding that ideal shape. We could divide the geographic plane into squares of the same size, or maybe triangles, pentagons, or something else! We at Locale use hexagons instead of any other shape and you can get a glimpse of why here:

Let’s use hexes to plot the MP for NYC. Uber has a well-suited library that converts lat, longs to hexes of a given size. You can find out more about it here.

From the visualization, we can see the distribution of Market Potential across the city in a much uniform way. We can filter these hexagons by market potential to get the best area to run promotions in.

Market Potential by Hexagonal areas

Another way to reduce more bias would be to not bind locations to areas initially which leads to our next section.

Enter, Geo-spatial Clustering!

The idea is to let data decide what the significant areas are on which MP should ideally be computed. A simple density-based clustering algorithm like DBSCAN could be a good place to start! We have written about the other types of clustering in case you want to check that out:

We cluster all locations based on proximity to each other, hence finding dense clusters. Then, we compute the convex hull of these areas to get area boundaries. Now, we have areas generated from the data itself, suitable for computing MP in a reduced human-bias way. Again, we filter by areas where M.P is more than 70% and get these places:

Hotspots detected by our algorithm with highest Market Potentials.

It’s worth noting that a simple density-based approach can work here because of the nature and definition of the problem.

Since we consider only locations, a density-based spatial clustering extracts the underlying areas which are more closely knit in terms of the user behaviour.

For this decision to be actually reliable for a business use case, instead of using just one metric, we take a set of metrics that largely affect the user behaviour in one area. For example, office-goers in an area vs university students might show completely different travel patterns.

Now, as highlighted here, using the automated learning techniques for similarity and clustering analyses that we have built at Locale.ai, our algorithms work hard to find what areas behave similarly based on the metrics that you care about (growth potential, unit economics, power users) and show you what areas to focus on for each unit of your business.

This enables you to save or export these areas profiles and keep reusing them for different kinds of analysis and then track them to maximize your revenue, demand and profitability in each of those areas rather than using arbitrarily defined boundaries.

The complete experiment along with the code is available here. Feel free to experiment with some other dataset!