Photo By Nick Tierney
Last week I needed to understand how to calculate how much two spatial areas overlapped. I found a nice thread on GIS stack exchange that explained an answer (from user Sandy AB). This pretty much gave me what I was looking for - but I was after something with more pictures, so here’s an example of that.
The Data
I started by getting some spatial areas from the ABS at this link, selecting, “statistical local areas level 4 2021”:
I then did a bit of processing to shrink the data down to just my hometown, Brisbane.
Let’s read this in with sf
:
library(sf)
#> Linking to GEOS 3.8.1, GDAL 3.2.1, PROJ 7.2.1
brisbane_sf <- read_sf(here::here("content/post/2021-08-18-how-do-i-find-out-how-much-a-shapefile-overlaps-another/data/brisbane-sla4.shp"))
The Data Vis
Let’s plot my hometown, Brisbane with ggplot2
and geom_sf()
.
Let’s say we have a simpler shape file that we want to compare to this - which we create by simplifying the shapefile with rmapshaper
.
library(rmapshaper)
#> Registered S3 method overwritten by 'geojsonlint':
#> method from
#> print.location dplyr
brisbane_simpler_sf <- brisbane_sf %>% ms_simplify(keep = 0.01)
With this simpler file, we want to know how much area we are losing. Here’s the “simpler” Brisbane in red.
We can overlay them to see how similar they are, full Brisbane in green, and new Brisbane in red:
ggplot() +
geom_sf(data = brisbane_sf,
fill = "forestgreen", alpha = 0.5) +
geom_sf(data = brisbane_simpler_sf,
fill = "firebrick",
alpha = 0.5)
So how do we calculate the difference between the two? We can use st_intersection
to find where both shapes overlap. Here it is visualised:
st_intersection(brisbane_sf, brisbane_simpler_sf) %>%
ggplot() +
geom_sf(fill = "brown", alpha = 0.5)
#> although coordinates are longitude/latitude, st_intersection assumes that they are planar
#> Warning: attribute variables are assumed to be spatially constant throughout all geometries
Compare this again to the previous plot - these are the brown sections.
But say we want to calculate the ratio of how similar these two shape files are - how do we do that? Here are the steps:
- Find the area of the shapefile for the original Brisbane file
- Find the area of the overlap between the two files
- Calculate the ratio of the area in the original shapefile, and the area of the overlapping area
- Calculate the area of the overlap compared to the original Brisbane file, giving us the percentage of overlap!
The area of the shapefile for the original Brisbane file
Here are the steps:
- Calculate area with
st_area
- Reduce the size of the data - then only keep the relevant data (just keeping the SA4 name (
SA4_NAME21
), and then area, then drop the geometry column).
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
brisbane_sf_areas <- brisbane_sf %>%
mutate(brisbane_original_area = st_area(.)) %>%
select(SA4_NAME21, brisbane_original_area) %>%
st_drop_geometry()
brisbane_sf_areas
#> # A tibble: 1 × 2
#> SA4_NAME21 brisbane_original_area
#> * <chr> [m^2]
#> 1 Brisbane Inner City 81738079
The area of the overlap between the two files
- Calculate the intersection of these two shape files (
st_intersection
) - Calculate that area (
st_area
) - Then only keep the relevant data again
intersection_area <- st_intersection(brisbane_sf, brisbane_simpler_sf) %>%
mutate(intersect_area = st_area(.)) %>%
select(SA4_NAME21, intersect_area) %>%
st_drop_geometry()
#> although coordinates are longitude/latitude, st_intersection assumes that they are planar
#> Warning: attribute variables are assumed to be spatially constant throughout all geometries
intersection_area
#> # A tibble: 1 × 2
#> SA4_NAME21 intersect_area
#> * <chr> [m^2]
#> 1 Brisbane Inner City 76638525
The ratio of the area in the original shapefile, and the area of the overlapping areas
Now we have our pieces, now let us add these columns of the other data back to it:
intersection_area %>%
left_join(brisbane_sf_areas,
by = "SA4_NAME21")
#> # A tibble: 1 × 3
#> SA4_NAME21 intersect_area brisbane_original_area
#> <chr> [m^2] [m^2]
#> 1 Brisbane Inner City 76638525 81738079
And then calculate the ratio - which will tell us how much the simpler shape file overlaps the original ratio.
intersection_area %>%
left_join(brisbane_sf_areas,
by = "SA4_NAME21") %>%
mutate(weight = intersect_area / brisbane_original_area)
#> # A tibble: 1 × 4
#> SA4_NAME21 intersect_area brisbane_original_area weight
#> <chr> [m^2] [m^2] [1]
#> 1 Brisbane Inner City 76638525 81738079 0.937611
And, because I think it’s good practice, all together as a function:
calculate_spatial_overlap <- function(shape_new,
shape_old) {
intersection_area <- st_intersection(shape_new, shape_old) %>%
mutate(intersect_area = st_area(.)) %>%
select(SA4_NAME21, intersect_area) %>%
st_drop_geometry()
# Create a fresh area variable for counties
shape_old_areas <- shape_old %>%
mutate(original_area = st_area(.)) %>%
select(original_area, SA4_NAME21) %>%
st_drop_geometry()
intersection_area %>%
left_join(shape_old_areas,
by = "SA4_NAME21") %>%
mutate(weight = intersect_area / original_area)
}
calculate_spatial_overlap(
brisbane_simpler_sf, brisbane_sf
)
#> although coordinates are longitude/latitude, st_intersection assumes that they are planar
#> Warning: attribute variables are assumed to be spatially constant throughout all geometries
#> # A tibble: 1 × 4
#> SA4_NAME21 intersect_area original_area weight
#> <chr> [m^2] [m^2] [1]
#> 1 Brisbane Inner City 76638525 81738079 0.937611
So this tells us:
The new shapefile covers about 93% of the old shape file.
That is, the red areas cover 93% of the green area:
Why do this?
Our use case in this example was to calculate the difference between shapefiles so we could then use this overalapping difference as a weight in subsequent measurements.
But you can do more with this - the example from the GIS stack exchange thread was trying to calculate the amount of overlap of a lot of smaller shape files on another shapefile - a measurement often referred to as “coverage”.
And that’s it - hope that helps someone!
Thanks
Thank you again to user Sandy AB from Stack Exchange, who posted the answer!