Geospatial Data Engineering: Spatial Indexing

Optimizing queries, improving runtimes, and geospatial data science applications

Dea Bardhoshi
Towards Data Science

--

Photo by Tamas Tuzes-Katai on Unsplash

Intro: why is a spatial index useful?

In doing geospatial data science work, it is very important to think about optimizing the code you are writing. How can you make datasets with hundreds of millions of rows aggregate or join faster? This is where concepts such as spatial indices come in. In this post, I will talk about how a spatial index gets implemented, what its benefits and limitations are, and take a look at Uber’s open source H3 indexing library for some cool spatial data science applications. Let’s get started!

🗺 What’s a spatial index?

A regular index is the kind of thing you might find at the end of a book: a list of words and where they have shown up in the text. This helps you quickly look up any reference to a word you’re interested in within a certain text. Without this handy tool, you would need to manually look through every page of your book, searching for that one mention you wanted to read about.

In modern databases, this issue of querying and searching is also very pertinent. Indexing often makes looking up data faster than filtering, and you can create indices based on a column of interest. For geospatial data in particular, engineers often need to look at operations like “intersection” or “is nearby to”. How can we make a spatial index so that these operations are as fast as possible? First, let’s take a look at some of this geospatial data:

Two non-intersecting features (image by author)

Let’s say that we want to run a query to determine if these two shapes are intersecting. By construction, spatial databases create their index out of a bounding box that contains the geometry:

Making a large bounding box (image by author)

For answering whether these two features intersect, the database will compare whether the two bounding boxes have any area in common. As you can see, this can quickly lead to false positives. To fix this issue, spatial databases like PostGIS typically partition these large bounding boxes into smaller and smaller ones:

Going smaller: making child bounding boxes (image by author)

These partitions are stored in R-trees. R-trees are a hierarchical data structure: they keep track of the large “parent” bounding box, its children, its children’s children and so on. Every parent’s bounding box contains its children’s bounding boxes:

R-tree visualized (image by author)

The operation “intersect” is one of the key operations that benefits from this structure. While querying an interesection, the database looks down this tree asking “does the current bounding box intersect the feature of interest?”. If yes, it looks at that bounding box’s children and asks the same question. As it does so, it is able to quickly traverse the tree, skipping the branches that do not have an intersection and thus improve the query’s performance. In the end, it returns the intersecting geometry as desired.

🧰 In Practice: trying out a spatial index with GeoPandas

Let’s now take a concrete look at what using a regular row-wise procedure vs a spatial index looks like. I’ll be using 2 datasets representing NYC’s Census Tracts as well as City Facilities (both licensed through Open Data, and available here and here). First, let’s try out the “intersection” operation in GeoPandas on one of the Census Tract geometries. ‘Intersection’ in GeoPandas is a row-wise function checking each row of the column of interest against our geometry and looking at whether they intersect or not.

GeoPandas also offers a spatial index operation that uses R-trees and allows us to perform intersections as well. Here is a runtime comparison of the two methods over 100 runs of the intersection operation (note: because the default intersection function is slow, I only selected around 100 geometries from the original dataset):

💨 How much faster is a spatial index? (image by author)

As you can see, the spatial index approach offered much improved performance over the vanilla intersection method. In fact, here are the 95% confidence intervals for the runtimes of each:

Great! Then, why would we ever not want to use a spatial indexing? Are there cases when it offers no benefits? Well, yes. Some of these limitations are due to the way spatial indexing stores the leaves in the data. It turns out that the way the raw data is distributed affects how bounding boxes are placed into the R-tree. Specifically, if a large chunk of the data is concentrated in the same geographic space, they will tend to share the same parents and thus be grouped together in the same branches. This can lead to skewed trees that don’t offer much optimization when querying.

💻 What do other spatial indices look like?

Other companies have also adapted their own spatial indices. Uber uses H3, a hexagonal hierarchical indexing system that partitions the world into equal-area hexagons. Hexagons have many benefits when modeling people’s movement around a city or for problems like calculating a radius. Geospatial data is bucketed into these hexagons, which serve as the company’s main unit of analysis. The grid is constructed by overlaying 122 hexagonal cells onto an icosahedron-map projection, and it supports a wide range of functions including aggregation, joining and machine learning applications.

This system as well as a lot of its functionality is open-source and available on GitHub for analysis. One of the functions of the H3 API is converting latitude and longitude points into strings representing a unique Hexagon, according to the specified resolution. Let’s do this operation over the entire Facilities Database and also convert the hexagon strings into polygons:

One question that often comes up in these spatial data analysis projects is how many projects in the hexagons are categorized by some column, say “agency”. Luckily, this is very easy to calculate and visualize now that we have the data bucketed into H3 hexagons:

Which agencies have the most facilities? (image by author)

In this case, you can see the DCAS (Department of Citywide Administrative Services) and PARKS (Department of Parks and Recreation) are the two agencies with the most facilities per each hexagon. This likely makes sense as these two agencies would have more physical facilities (think admin buildings or recreational areas like parks).

Conclusion

Spatial indices, as you saw, are very useful optimization tools for geospatial data science and analysis. In the case of a simple intersection query, using a spatial index dramatically improved the performance of the query compared to the standard GeoPandas intersection function. There were many nuances to how this index is implemented, as well as its implications, like having huge branches of clustered data. As we saw, companies have developed their own solutions: one example is Uber’s H3 open-source index, which allows us to answer various spatial analysis questions. While I demonstrated an agency-based facilities count operation, H3 provides a baseline for other more complex machine learning applications.

If you like this kind of content but want to learn about urban planning tech more broadly, I also write a newsletter called “The Zoned Out Chronicles”. I encourage you to check it out!

Thanks for reading!

--

--

👩‍💻 Data Science UC Berkeley '23 | 🏙 Data Science, Urban Planning, Civic Technology | ✍️ Newsletter: https://deabardhoshi.substack.com/