No description has been provided for this image
Open In Colab

Basic Spatial operations on Geo Dataframes¶

We will review some important operations for GeoDataframes (GDF). This is a basic set of tools for a social scientist, but which depend a lot on the quality of the maps you have.

A spatial operation is a way of doing arithmetics with geometries! Our inputs being the maps (polygons, lines, or points) will be summed, differentiated, filtered, dissected, and so on.

Keep in mind that as basic operations, they will be used later for practical applications in the coming weeks.

Getting ready¶

The links to the our maps on GitHub are here:

In [ ]:
linkWorldMap="https://github.com/CienciaDeDatosEspacial/dataSets/raw/refs/heads/main/WORLD/worldMaps.gpkg"
linkBrazil="https://github.com/CienciaDeDatosEspacial/dataSets/raw/refs/heads/main/BRAZIL/brazil_5880.gpkg"
linkIndicators="https://github.com/CienciaDeDatosEspacial/dataSets/raw/refs/heads/main/WORLD/worldindicators.json"

Let me introduce another data source on seaports. Their locations in long/lat is stored in the file seaports.csv which we already have on GitHub:

In [ ]:
linkSeaPorts='https://github.com/CienciaDeDatosEspacial/dataSets/raw/refs/heads/main/WORLD/seaports.csv'

Let's get some maps:

In [ ]:
import geopandas as gpd

#world 
world_rivers=gpd.read_file(linkWorldMap,layer='rivers')
#brazil 
brazil5880=gpd.read_file(linkBrazil,layer='country')
airports_brazil5880=gpd.read_file(linkBrazil,layer='airports')
states_brazil5880=gpd.read_file(linkBrazil,layer='states')
municipalities_brazil5880=gpd.read_file(linkBrazil,layer='municipalities')
#some indicators
indicators=gpd.read_file(linkIndicators)

# the seaports
import pandas as pd 
infoseaports=pd.read_csv(linkSeaPorts)
In [ ]:
# the sesports data has too manny columns:
infoseaports.columns.to_list()

Let me keep some columns, and turn the DF into a GDF:

In [ ]:
#rename
infoseaports.rename(columns={'Main Port Name':'seaport_name','Country Code':'country_name'},inplace=True)
#keep few columns
infoseaports=infoseaports.loc[:,['seaport_name', 'country_name','Latitude', 'Longitude']]
#spatial points (unprojected)
seaports=gpd.GeoDataFrame(data=infoseaports.copy(),
                           geometry=gpd.points_from_xy(infoseaports.Longitude,
                                                       infoseaports.Latitude), 
                          crs=4326)# notice it is unprojected

# keep Brazil
seaports_bra=seaports[seaports['country_name']=='Brazil'].copy()

# reset indexes
seaports_bra.reset_index(drop=True, inplace=True)

# reprojecting
seaports_brazil5880=seaports_bra.to_crs(5880) # projected crs

Now, let's see some important spatial operations!


UNARY Operations on Geo DataFrames¶

As the name implies, all we will do here requires as input ONE GDF.

I. Filtering¶

a. Using iloc and loc¶

The condition here depends on knowing index/column positions, or index/column labels.

In [ ]:
#if we have
states_brazil5880.head()
In [ ]:
# iloc for positions (same as in pandas DF)

states_brazil5880.iloc[:10,1:]
In [ ]:
# loc for lables (same as in pandas DF)
states_brazil5880.loc[:8,'state_code':]

Keep in mind that if you do not include the geometry column, you will get a DataFrame (DF) back, not a GeoDF.

In [ ]:
# GDF
type(states_brazil5880.loc[:8,'state_code':])
In [ ]:
# df
type(states_brazil5880.loc[:8,:'state_code'])

Also remember this detail:

In [ ]:
# you lost the spatial structure when keeping ONE row!
type(states_brazil5880.loc[8,:])
In [ ]:
# you keep the spatial structure if the row index is a list
type(states_brazil5880.loc[[8],:])

b. More Pandas Filtering for GeoPandas¶

In [ ]:
# complex conditions with query
airports_brazil5880.query('elevation_ft > 5000 and airport_type=="small_airport"')
In [ ]:
# filter based on subset
choices=['large_airport','seaplane_base']
airports_brazil5880[airports_brazil5880.airport_type.isin(choices)]
In [ ]:
# filter with text - startswith / endswith
textCondition=('Presi','Depu')
airports_brazil5880[airports_brazil5880.airport_name.str.startswith(textCondition)]
In [ ]:
# filter with text - contains (more flexible)
textPattern='Presidente|Deputado'
airports_brazil5880[airports_brazil5880.airport_name.str.contains(textPattern)]
# notice 'Refinaria Presidente Bernardes Heliport'
In [ ]:
# Filter rows where specific column is NOT null
airports_brazil5880[airports_brazil5880.elevation_ft.notna()]
In [ ]:
# Filter rows where specific column IS null
airports_brazil5880[airports_brazil5880.elevation_ft.isna()]

c. Slicing with cx¶

But as a GDF, you can also filter using coordinates via cx.

Let me get Brazil's centroid to show you how this works:

In [ ]:
brazil5880_cen=brazil5880.centroid
brazil5880_cen

Here, I recover each coordinate values:

In [ ]:
mid_x,mid_y=brazil5880_cen.x[0],brazil5880_cen.y[0]
mid_x,mid_y

Let me select airports north of the centroid:

In [ ]:
airports_brazil5880.cx[:,mid_y:]

Confirming we got it right:

In [ ]:
# the viz
base=brazil5880.plot(color='yellow') # brazil on the back
airports_brazil5880.cx[:,mid_y:].plot(ax=base,markersize=1) # all the brazilian northern airports 
brazil5880.centroid.plot(color='red',ax=base) # check the centroid

When the GDF is made of points, the cx would give clean results. Do not expect that when GDF is made of polygons.

Let me use it to split the states:

In [ ]:
# the north
N_brazil=states_brazil5880.cx[:,mid_y:]
# the south
S_brazil=states_brazil5880.cx[:,:mid_y]
# the west
W_brazil=states_brazil5880.cx[:mid_x,:]
# the east
E_brazil=states_brazil5880.cx[mid_x:,:]

Notice the centroid does not cut polygons:

In [ ]:
base=N_brazil.plot()
brazil5880.centroid.plot(color='red',ax=base)
In [ ]:
base=W_brazil.plot()
brazil5880.centroid.plot(color='red',ax=base)

d. Filtering by attribute value¶

The GDF 'states_brazil5880' has only these columns:

In [ ]:
states_brazil5880.columns

But, since it is projected, you can get the area of the polygon:

In [ ]:
# This filter finds states larger than 1,000,000 km²
states_brazil5880[states_brazil5880.area > 1000000000000]

II. Combining geometries¶

Let's remember this contents:

In [ ]:
#see
municipalities_brazil5880.head(20)

Then, this is Rondônia:

In [ ]:
muniRondonia=municipalities_brazil5880[municipalities_brazil5880.state_name=='Rondônia']
In [ ]:
muniRondonia.plot(edgecolor='yellow')

We saw several polygons, combining them means they will become one polygon. We have more than one way to achieve that.

II.1 Unary UNION¶

We can combine all these polygons into one:

In [ ]:
muniRondonia.union_all()

Let's save that result:

In [ ]:
Rondonia_union=muniRondonia.union_all()
In [ ]:
# what do we have?
type(Rondonia_union)

You can turn that shapely object into a GeoDF like this:

In [ ]:
gpd.GeoDataFrame(geometry=[Rondonia_union]) # the recent union

Even better:

In [ ]:
gpd.GeoDataFrame(index=[0], # one element
                 data={'state':'Rondonia'}, # the column and the value
                 crs=muniRondonia.crs, # important to avoid naive geometries
                 geometry=[Rondonia_union]) # the recent union

II.2 Dissolve¶

a. Dissolve as Union¶

Using dissolve is an alternative to UNION:

In [ ]:
muniRondonia.dissolve()

Let me save the result, and see the type :

In [ ]:
Rondonia_dissolved=muniRondonia.dissolve()

# we got?
type(Rondonia_dissolved)

You got a GDF this time:

In [ ]:
## see
Rondonia_dissolved

Some minimal changes to Rondonia_dissolved:

In [ ]:
# keeping what is relevant
Rondonia_dissolved.drop(columns=['municipality_name','municipality_code'],inplace=True)

# then
Rondonia_dissolved

Notice dissolving returns a GDF, but also keeps the information of one the lower level units that were dissolved. Unionazing returned just a geometry.

b. Dissolve for groups¶

Using dissolve() with no arguments returns the union of the polygons as above, AND also you get a GDF. However, if you have a column that represents a grouping (as we do), you can dissolve by that column:

In [ ]:
# dissolving municipalities again!- but by state
municipalities_brazil5880.dissolve(by='state_name').plot(facecolor='lightgrey', edgecolor='black',linewidth=0.2)

Again, let me save this result:

In [ ]:
Brazil_munitoStates=municipalities_brazil5880.dissolve(by='state_name')

We know we have a GDF; let's see contents:

In [ ]:
Brazil_munitoStates.head()

Again, we can drop columns that came from the lower level:

In [ ]:
Brazil_munitoStates.drop(columns=['municipality_name',	'municipality_code'],inplace=True)
Brazil_munitoStates.reset_index(inplace=True)
Brazil_munitoStates.info()

c. Dissolve and aggregate¶

In pandas, you can aggregate data using some statistics. Let's see the map with indicators we created in a previous session:

In [ ]:
indicators.head()

You can compute the mean of the countries by region, using a DF approach like this:

In [ ]:
indicators.groupby('region').agg({'fragility':'mean'}) 

You do not see a "geometry" column. It got lost when using groupby().agg().

The appropriate operation to conserve spatial information is also dissolve:

In [ ]:
indicatorsByRegion=indicators.dissolve(
    by="region", #groupby()
    aggfunc={"fragility": "mean"}, #agg()
    )

## see the GeoDF
indicatorsByRegion

Without renaming, you can request a choropleth:

In [ ]:
# You may need to install if using Colab
# !pip install mapclassify
In [ ]:
indicatorsByRegion.plot(column ='fragility',edgecolor='white',
                        figsize=(15, 10))

Keep in mind that the combining of objects via UNION_ALL and DISSOLVE are destructive, we can not undo them. We have operations like EXPLODE that work in the reverse direction (splitting) but even that function can not undo the output of UNION_ALL and DISSOLVE. Always preserve your original GDF before using these operations, as they permanently alter your data in ways that cannot be reversed. Did you notice something wrong in this last plot?


III. Enveloping geometries: the convex hull¶

Sometimes you may have the need to create a polygon that serves as an envelope to a set of points.

For this example, let me use the large airports:

In [ ]:
large_airports=airports_brazil5880[airports_brazil5880.airport_type=='large_airport']
large_airports.plot()

How to create a minimum polygon that envelops those points?

In [ ]:
## you see no difference!!
large_airports.convex_hull.plot()

The objects to be enveloped required to be previously combined:

In [ ]:
# hull of the union
large_airports.union_all().convex_hull

The structure we got is:

In [ ]:
# this geometry not a GeoDF...yet
type(large_airports.union_all().convex_hull)

Let's turn this geometry into a GDF:

In [ ]:
LargeAirports_hull= gpd.GeoDataFrame(index=[0],
                                     data={'hull':'Large airports'}, # the column and the value
                                    crs=large_airports.crs,
                                    geometry=[large_airports.union_all().convex_hull])

# then

LargeAirports_hull

Let's use the GDF in plotting:

In [ ]:
base=brazil5880.plot(facecolor='yellow')
large_airports.plot(ax=base)
LargeAirports_hull.plot(ax=base,facecolor='green',
                       edgecolor='white',alpha=0.4,
                       hatch='X')

You can get a convex hull of lines or polygons:

In [ ]:
# You can use it for dissolved polygons:
Rondonia_dissolved.convex_hull.plot()

Remember that union_all and dissolve() give different outputs:

In [ ]:
# you got a series, not just a geometry 
type(Rondonia_dissolved.convex_hull)
In [ ]:
# a simple "to_frame" does the job
Rondonia_dissolved.convex_hull.to_frame()

Here, we will turn that GeoSeries into a GDF:

In [ ]:
# more details
Rondonia_hull=Rondonia_dissolved.convex_hull.to_frame()
Rondonia_hull.rename(columns={0:"geometry"},inplace=True)
Rondonia_hull.set_geometry('geometry',inplace=True)
Rondonia_hull["name"]="Rondonia"
Rondonia_hull
In [ ]:
#noticed the crs was inherited
Rondonia_hull.crs

Unless you need a hull per row, you need to union/dissolve the polygons (rows) of a GeoDF, see:

In [ ]:
#original not COMBINED:
Brazil_munitoStates.plot(edgecolor="yellow")
In [ ]:
# hull of Non combined
Brazil_munitoStates.convex_hull.plot(edgecolor="yellow")
In [ ]:
# the hull of Brazil
Brazil_munitoStates.dissolve().convex_hull.plot(edgecolor="yellow")

IV. Buffering geometries¶

The buffer will create a polygon that follows the same shape of the original vector (line, polygon, point).

Let's see the Amazon River system:

In [ ]:
AmazonSystem=world_rivers[world_rivers.SYSTEM=='Amazon']
AmazonSystem.plot()

As this is not projected...

In [ ]:
AmazonSystem.crs.is_projected

We should reproject as buffering works with distances:

In [ ]:
AmazonSystem_5880=AmazonSystem.to_crs(5880)

Now I can use the rivers to create a buffer of 50000 meters:

In [ ]:
# 50000 at each side (radius)
AmazonSystem_5880.buffer(50000).plot(facecolor='yellow', edgecolor='black',linewidth=0.2)

The resulting buffer is:

In [ ]:
type(AmazonSystem_5880.buffer(50000))

Then:

In [ ]:
base=AmazonSystem_5880.buffer(50000).plot(facecolor='yellow',edgecolor='black',linewidth=0.2)
AmazonSystem_5880.plot(ax=base)

Notice that buffering can be customized:

In [ ]:
riv_buf_right = AmazonSystem_5880.buffer(distance = 50000, single_sided = True)
riv_buf_left = AmazonSystem_5880.buffer(distance = -25000, single_sided = True)

base =riv_buf_right.plot(color='green')
riv_buf_left.plot(ax=base, color='purple')

Let me save the rivers reprojected in a JSON file:

In [ ]:
AmazonSystem_5880.to_file("AmazonSystem_5880.geojson", driver="GeoJSON")

BINARY Operations:¶

I. Distance¶

Distance is a key binary operation as so many practical policy matters depend on knowing distances between objects in space.

Any pair of rightly projected GDFs have a distance between them. Below we can make query using distances:

Which are the airports whose distance to Brazil centroid is > 2500000?

In [ ]:
# this is the centroid we have:
brazil5880_cen,brazil5880_cen.iloc[0]

Then,

In [ ]:
airports_brazil5880[airports_brazil5880.distance(brazil5880_cen.iloc[0]) > 2500000]

The results can be confirmed visually:

In [ ]:
base=airports_brazil5880[airports_brazil5880.distance(brazil5880_cen.iloc[0]) > 2500000].plot(marker='+',markersize=100)
airports_brazil5880.plot(ax=base,color='grey', markersize=0.1)
brazil5880_cen.plot(ax=base,color='red')

Let me review how distances work between different kinds of geometries:

a. Distance between points¶

We have these points:

In [ ]:
seaports_brazil5880.head()
In [ ]:
large_airports.head()
In [ ]:
large_airports.distance(seaports_brazil5880.loc[0,'geometry'])

It is easy to compute distances, but a good manipulation of the GDF will help us better understand the result:

In [ ]:
# airport names as the index instead of row numbers:
large_airports.set_index('airport_name',inplace=True)
large_airports

Let's do the same for seaports:

In [ ]:
seaports_brazil5880.set_index('seaport_name',inplace=True)

Then, this is...

The distance from each large airport to 'Dtse / Gegua Oil Terminal' in km.

In [ ]:
large_airports.distance(seaports_brazil5880.geometry.iloc[0])/1000

What about computing...

All the distances between large aiports and seaports (in km)

In [ ]:
# apply creates a LOOP, computes distances from each seaport to all large airports
seaports_brazil5880.geometry.apply\
(lambda seaport: large_airports.geometry.distance(seaport)/1000)

If we save the matrix...

In [ ]:
D_Matrix_sea_air=seaports_brazil5880.geometry.apply \
                (lambda seaport: large_airports.geometry.distance(seaport)/1000)

We can compute some distance stats from there:

In [ ]:
Stat_sea_air=pd.DataFrame()
Stat_sea_air['mean']=D_Matrix_sea_air.mean(axis=1) # mean D to all airports
Stat_sea_air['min']=D_Matrix_sea_air.min(axis=1)# min D to all airports
Stat_sea_air['max']=D_Matrix_sea_air.max(axis=1)# max D to all airports

# see some
Stat_sea_air.head(10)

Of course, the idmax adn idmin could come in handy:

In [ ]:
# farthest airport to each seaport
D_Matrix_sea_air.idxmax(axis=1).head()
In [ ]:
# farthest seaport to each airport
D_Matrix_sea_air.idxmax(axis=0).head()
In [ ]:
# closest airport to each seaport
D_Matrix_sea_air.idxmin(axis=1).head()
In [ ]:
# closest seaport to each airport
D_Matrix_sea_air.idxmin(axis=0).head()

b. Distance between line and point¶

Once we know understand how distance and idxmin/idxmax work, we can feel comfortable in this stage.

Let's use these rivers from before:

In [ ]:
AmazonSystem_5880

Then,

Distance from river Tapajos to Guarulhos airport

In [ ]:
indexLabel='Guarulhos - Governador André Franco Montoro International Airport'
AmazonSystem_5880[AmazonSystem_5880.RIVER=='Tapajos'].distance(large_airports.geometry.loc[indexLabel])/1000

Reformat the Amazon river GDF:

In [ ]:
AmazonSystem_5880.set_index('RIVER',inplace=True)

We can compute the distance matrix now:

In [ ]:
D_Matrix_amazRivs_air=AmazonSystem_5880.geometry.apply \
                (lambda river: large_airports.geometry.distance(river)/1000)
In [ ]:
Stat_amz_air=pd.DataFrame()
Stat_amz_air['mean']=D_Matrix_amazRivs_air.mean(axis=1) # mean D to all airports
Stat_amz_air['min']=D_Matrix_amazRivs_air.min(axis=1)# min D to all airports
Stat_amz_air['max']=D_Matrix_amazRivs_air.max(axis=1)# max D to all airports

# see some
Stat_amz_air.head(10)
In [ ]:
# closest river to each airport
D_Matrix_amazRivs_air.idxmin(axis=0).head()
In [ ]:
# farthest river to each airport
D_Matrix_amazRivs_air.idxmax(axis=0).head()

c. Between Polygon and Point¶

Let me re use the world rivers:

In [ ]:
river_systems=world_rivers[world_rivers.SYSTEM.isin(['Amazon','Parana'])]
river_systems
In [ ]:
ama_para=river_systems.dissolve(by='SYSTEM')
ama_para.drop(columns='RIVER',inplace=True)
ama_para

We still have lines:

In [ ]:
ama_para.plot(cmap='viridis')

But we will have polygons after this:

In [ ]:
ama_para.convex_hull.plot(cmap='viridis')

As we have a geoseies of two geometries...

In [ ]:
ama_para.convex_hull,type(ama_para.convex_hull)

Let's turn that into a GDF:

In [ ]:
ama_para_hulls=ama_para.convex_hull.to_frame()
ama_para_hulls.rename(columns={0:'geometry'},inplace=True)
ama_para_hulls=ama_para_hulls.set_geometry('geometry')
ama_para_hulls.crs="EPSG:5880"

#voila
ama_para_hulls

And now, the distance matrix:

In [ ]:
D_Matrix_rivsHulls_air=ama_para_hulls.geometry.apply \
                (lambda system: large_airports.geometry.distance(system)/1000)
D_Matrix_rivsHulls_air

From here, you can compute distances between other kinds of geometries.

II. Clipping¶

Clipping uses a GDF geometry as a MASK to cut another GDF which suppossedly is bigger and needes to be clipped.

Pay attention to the world rivers again:

In [ ]:
world_rivers

As you see, this GDF has no Country column. But since it has geometry, you can keep the rivers, or their sections, that serve a country:

In [ ]:
rivers_brazil5880 = gpd.clip(gdf=world_rivers.to_crs(5880),
                             mask=brazil5880)

Then, you can plot the clipped version:

In [ ]:
base = brazil5880.plot(facecolor="greenyellow", edgecolor='black', linewidth=0.4,figsize=(5,5))
rivers_brazil5880.plot(edgecolor='blue', linewidth=0.5,
                    ax=base)

We can create our own mask for clipping:

Let me get the bounding box of the map (the smallest possible rectangle that completely encloses a geometric shape or set of shapes):

In [ ]:
brazil5880.total_bounds #[minx, miny, maxx, maxy]
In [ ]:
# or
minx, miny, maxx, maxy=brazil5880.total_bounds
minx, miny, maxx, maxy

I will combine those coordinates with the centroid (mid_x,mid_y)to create a BOX of the north and south of Brazil:

In [ ]:
north_mask = [minx, mid_y, maxx, maxy]
south_mask = [minx, minx, maxx, mid_y]

# split Brazil
states_brazil5880.clip(north_mask).plot(edgecolor="yellow")
In [ ]:
states_brazil5880.clip(south_mask).plot(edgecolor="yellow")

As you see, with clip we can cut polygons (not respecting the borders).

III. Spatial Joins¶

We’re familiar with merging, which joins tables using common keys. Spatial joins, by contrast, rely solely on geometry columns to perform various types of filtering.

Let me use the brazilian large airports and states:

In [ ]:
large_airports.head()

...and:

In [ ]:
states_brazil5880.head()

a. Within¶

Let's ask:

The large airports whose geometries are within the borders of a state in Brazil.

In [ ]:
airports_within_states = gpd.sjoin(
    large_airports,         # LEFT: airports we want to filter/keep
    states_brazil5880,      # RIGHT: spatial boundaries to check against
    how='inner',            # return geometries that match in both LEFT/RIGHT (jointype)
    predicate='within'      # spatial condition: LEFT geometry within RIGHT geometry
)

# these are:
airports_within_states

We just performed a point-to-polygon spatial join. Notice that the result preserves the original geometries from the LEFT GeoDataFrame — specifically, only those features whose spatial relationship satisfied both the predicate (e.g., 'within') and the join type ('inner'). The non-geometric attributes (columns) from the RIGHT GeoDataFrame are joined to the matching rows.

b. Contains¶

Importantly, if the LEFT GeoDataFrame contains polygons and the RIGHT contains points (a polygon-to-point join), you’ll typically need to use a different predicate — such as 'contains' — to express the spatial relationship correctly.

Brazilian states that house large airports

In [ ]:
states_containing_LargeAirports = gpd.sjoin(states_brazil5880,large_airports,how='inner',
                                            predicate='contains')

states_containing_LargeAirports

c. Intersects¶

'Contains' is literally strict: Any airport located exactly on a state boundary — whether due to data precision, snapping, or real geography — will be excluded, even if it’s “practically” inside the state. More flexibility is achieved with intersects.

Large airports whose location is anywhere belonging to a particular state

In [ ]:
## Intersects needs at least a common point between both GeoDFs. 
gpd.sjoin(states_brazil5880,large_airports,
          how='inner', predicate='intersects')

d. Touches¶

We also have 'touches', a more stringent predicate than 'intersects'. It returns geometries that:

  • Share a border (for polygons or lines), or
  • Contact at exactly one point (for points or endpoints).

Which states are neighbors of 'BAHIA", including BAHIA

In [ ]:
# Neighbors of Bahia?
gpd.sjoin(N_brazil.loc[N_brazil.state_name=='Bahia',:],N_brazil,how='inner', predicate='intersects').shape

That is, Bahia seems to share borders with 5 states:

In [ ]:
base=gpd.sjoin(N_brazil,N_brazil.loc[N_brazil.state_name=='Bahia',:],
               how='inner', 
               predicate='intersects').plot(color='yellow',edgecolor='red')
N_brazil.loc[N_brazil.state_name=='Bahia',:].plot(ax=base, color='red')

However, because many free GeoDataFrames — especially those sourced as Shapefiles — contain topological imperfections like gaps, overlaps, or misaligned vertices, 'touches' often fails to detect what should be adjacent features. Ironically, this “failure” can be useful: 'touches' acts as a diagnostic tool — highlighting where boundaries are not perfectly aligned.

In [ ]:
gpd.sjoin(N_brazil.loc[N_brazil.state_name=='Bahia',:],N_brazil,how='inner', predicate='touches').shape

See the neighbor that disappears:

In [ ]:
base=gpd.sjoin(N_brazil,N_brazil.loc[N_brazil.state_name=='Bahia',:],
               how='inner', 
               predicate='touches').plot(color='yellow',edgecolor='red')
N_brazil.loc[N_brazil.state_name=='Bahia',:].plot(ax=base, color='red')

e. Crosses¶

When we have lines, we may need crosses. Let me subset our rivers:

In [ ]:
amazonSystem=rivers_brazil5880[rivers_brazil5880.SYSTEM=='Amazon']
amazonSystem

Then,

Which rivers from the Amazon system are intersecting states?

In [ ]:
gpd.sjoin(amazonSystem,states_brazil5880,how='inner', predicate='intersects').shape

Alternatively,

Which rivers from the Amazon system are crossing states?

In [ ]:
gpd.sjoin(amazonSystem,states_brazil5880,how='inner', predicate='crosses').shape

Again, intersects means both geometries have some 'space' in common. But crosses is an intersection that has to cross the spatial object. From the result above, there is one river that shares space with the state, but is not crossing its border.

In [ ]:
# Get intersects result
intersects_result = gpd.sjoin(amazonSystem,states_brazil5880, how='inner', predicate='intersects')

# Get crosses result
crosses_result = gpd.sjoin(amazonSystem,states_brazil5880, how='inner', predicate='crosses')

# Find indexes/columns
riverIndex_notCrossing=list(set(intersects_result.index)-set(crosses_result.index))
stateIndex_notCrossed=intersects_result[intersects_result.index.isin(riverIndex_notCrossing)].index_right

# see
states_brazil5880.loc[stateIndex_notCrossed,"state_name"], amazonSystem.loc[riverIndex_notCrossing,"RIVER"]

Now we know the river that is not crossing an state, and the name of that state.

In [ ]:
base=states_brazil5880.loc[stateIndex_notCrossed,:].plot(color='w',edgecolor='k',figsize=(12, 8))
amazonSystem.plot(ax=base)
amazonSystem.loc[riverIndex_notCrossing,:].plot(color='red',ax=base)

IV. Spatial Overlay¶

As the name implies, you need two inputs. We might need to create or find some geometries from the geometries we already have. Using a set theory approach, we will see the use of intersection, union, difference, and symmetric difference. Let's remember these results:

In [ ]:
N_brazil
In [ ]:
S_brazil

Let me plot both of them:

In [ ]:
base= N_brazil.plot(facecolor='black', edgecolor='white',linewidth=0.2, alpha=0.6)
S_brazil.plot(facecolor='white', edgecolor='black',linewidth=0.2,ax=base, alpha=0.6)

The grey area reminds you that the coordinates we used to split the states did not give us a clean cut. Here you see the states in common:

In [ ]:
set(S_brazil.state_name) & set(N_brazil.state_name)

The same happened in East vs West:

In [ ]:
set(E_brazil.state_name) & set(W_brazil.state_name)
In [ ]:
# visualizing
base= E_brazil.plot(facecolor='black', edgecolor='white',linewidth=0.2, alpha=0.6)
W_brazil.plot(facecolor='white', edgecolor='black',linewidth=0.2,ax=base, alpha=0.6)

Let's play with these GDFs, keep in mind the position of the GDFs:

GDF_left.overlay(GDF_right, ...)

a. Intersection¶

We keep what is common between left and right GDFs :

In [ ]:
NS_brazil=N_brazil.overlay(S_brazil, how="intersection",keep_geom_type=True)
# see results
NS_brazil

Notice we got more rows than when we did this operation:

set(S_brazil.state_name) & set(N_brazil.state_name)

We have three more polygons:

In [ ]:
NS_brazil[NS_brazil.state_name_1!= NS_brazil.state_name_2]

In fact, we are NOT intersecting state names, we are intersecting geometries. Then, the input maps have some topological issues.

This is the amount of area that is in fact a topological problem:

In [ ]:
NS_brazil[NS_brazil.state_name_1!= NS_brazil.state_name_2].geometry.area.sum()

A way to measure the share of the low quality:

In [ ]:
NS_brazil[NS_brazil.state_name_1!= NS_brazil.state_name_2].geometry.area.sum()/  \
NS_brazil[NS_brazil.state_name_1== NS_brazil.state_name_2].geometry.area.sum() #continues from above

So, spatial overlay operations do their best to give you true results; but unfortunately, as the quality of the sources is not perfect, you may get messy results. It is our job to detect and make decisions. Let's keep two GDFs, one with the unperfect result, and another with a 'valid' output.

In [ ]:
NS_brazil_messy=NS_brazil.copy()
NS_brazil=NS_brazil[NS_brazil.state_name_1== NS_brazil.state_name_2]

This should be what we expected to see:

In [ ]:
NS_brazil

The clean data has minor things to improve, delete redundant columns, rename columns, and reset the index so they are a correlative sequence.

In [ ]:
# avoid redundancy
keep=['state_name_1','state_code_1','geometry']
NS_brazil=NS_brazil.loc[:,keep]
NS_brazil.rename(columns={'state_name_1':'state_name','state_code_1':'state_code'},inplace=True)

# reset for correlative sequence
NS_brazil.reset_index(drop=True, inplace=True)

Based on the previous case, we may expect a similar situation here:

In [ ]:
# keeping the overlay
WE_brazil=W_brazil.overlay(E_brazil, how="intersection",keep_geom_type=True)
WE_brazil[WE_brazil.state_name_1!= WE_brazil.state_name_2]

Let's do the same as before:

In [ ]:
WE_brazil_messy=WE_brazil.copy()
WE_brazil=WE_brazil[WE_brazil.state_name_1== WE_brazil.state_name_2]

keep=['state_name_1','state_code_1','geometry']
WE_brazil=WE_brazil.loc[:,keep]
WE_brazil.rename(columns={'state_name_1':'state_name','state_code_1':'state_code'},inplace=True)
WE_brazil.reset_index(drop=True, inplace=True)

b. Union¶

Different from UNION_ALL (which acts as DISSOLVE), here we will combine the left and right GDFs.

In [ ]:
NS_brazil.info()
In [ ]:
WE_brazil.info()
In [ ]:
# now
NS_brazil.overlay(WE_brazil,how="union",keep_geom_type=True)

As you see, geometries are fine, but missing values were created where no intersection exists. Notice this operation identifies the intersection.

Pandas has a concat() function, which just pastes two tables, one of top of the other. Notice this will give you one more row (Mato Grosso appears twice) as nothing is done spite of geometries intersect.

In [ ]:
# appending
import pandas as pd

pd.concat([NS_brazil,WE_brazil],ignore_index=True)

Let me create an object to save the overlay result:

In [ ]:
MidBrazil=NS_brazil.overlay(WE_brazil,how="union",keep_geom_type=True).dissolve()
MidBrazil
In [ ]:
# some cleaning

MidBrazil['country']='Brazil'
MidBrazil['region']='center'
# reordering
MidBrazil=MidBrazil.loc[:,['country','region','geometry']]

MidBrazil
In [ ]:
# see it
base=brazil5880.plot(facecolor='yellow')
MidBrazil.plot(ax=base)

c. Difference¶

Here, you keep what belongs to the left GDF that is not in the right GDF:

In [ ]:
# we keep nothern states that are not in the southern ones
N_brazil.overlay(S_brazil, how='difference')
In [ ]:
# using set operations:
set(N_brazil.state_name)- set(S_brazil.state_name)

We got a clean result (remember we are using the not messy GDFs). Let's plot this:

In [ ]:
base=N_brazil.plot(color='yellow', edgecolor='black',alpha=0.1)
N_brazil.overlay(S_brazil, how='difference').plot(ax=base)

Keep in mind that difference is not commutative:

In [ ]:
S_brazil.overlay(N_brazil, how='difference')

This a totally different result:

In [ ]:
base=N_brazil.plot(color='yellow', edgecolor='black',alpha=0.1)
S_brazil.overlay(N_brazil, how='difference').plot(ax=base)

d. Symmetric Difference¶

Here, we keep what is not in the intersection but in both GDFs. Notice that this operation is commutative!

In [ ]:
N_brazil.overlay(S_brazil, how='symmetric_difference')

This operation gave a clean result again. Let's plot it:

In [ ]:
N_brazil.overlay(S_brazil, how='symmetric_difference').plot()

Validity of Geometries¶

Geometries are created in a way that some issues may appear, especially in (multi) polygons. Let's check if our recent maps on states and municipalities are valid:

In [ ]:
# non valid
S_brazil[~S_brazil.is_valid]
In [ ]:
# see the invalid:
S_brazil[~S_brazil.is_valid].plot()

It is difficult to see what is wrong. Let's get some information:

In [ ]:
# what is wrong?

from shapely.validation import explain_validity, make_valid

explain_validity(S_brazil[~S_brazil.is_valid].geometry)

This is the report:

In [ ]:
explain_validity(S_brazil.geometry).str.split("[",expand=True)[0].value_counts()

Let's use make_valid:

In [ ]:
S_brazil_valid=S_brazil.copy()

S_brazil_valid['geometry'] = [make_valid(row)  if not row.is_valid else row for row in S_brazil['geometry'] ]

#any invalid?
S_brazil_valid[~S_brazil_valid.is_valid]

The use of make_valid may output collections, this is not desirable:

In [ ]:
pd.Series([type(x) for x in S_brazil_valid.geometry]).value_counts()

Buffers and Validity¶

The buffering process helps cleaning simple invalidities:

In [ ]:
S_brazil_valid=S_brazil.copy()

S_brazil_valid['geometry'] = S_brazil_valid['geometry'].buffer(0)

#any invalid?
S_brazil_valid[~S_brazil_valid.is_valid]

This 'buffer trick' may not always work:

In [ ]:
# previously
indicatorsByRegion.plot(column =indicatorsByRegion.index,
                        edgecolor='white',
                        figsize=(15, 10))

The worst cases seem AFRICA and EAST AND SOUTHEAST ASIA, as both show some lines that should have disappeared after the dissolving we did a while ago.

Did the dissolving process created invalid geometries?

In [ ]:
indicatorsByRegion.geometry.is_valid.value_counts()

Since we do not have invalid geometries, we know the dissolving created some gaps, so the goal is to snap the boundaries together to eliminate these microscopic gaps.

We could try the trick of buffer(0), again:

In [ ]:
indicatorsByRegion_prjd=indicatorsByRegion.to_crs("ESRI:54052").copy()
indicatorsByRegion_prjd['geometry'] = indicatorsByRegion_prjd.buffer(0)

# previously
indicatorsByRegion_prjd.plot(column =indicatorsByRegion_prjd.index,
                        edgecolor='white',
                        figsize=(15, 10))

It did not work either. We may increase the buffer:

In [ ]:
indicatorsByRegion_prjd['geometry'] = indicatorsByRegion_prjd.buffer(1)

indicatorsByRegion_prjd.plot(column =indicatorsByRegion_prjd.index,
                        edgecolor='white',
                        figsize=(15, 10))

The last version did got rid of the gaps, at least visually. Let's just check the counts in each case:

In [ ]:
[(r,len(g.geoms)) for r,g in zip(indicatorsByRegion.index,indicatorsByRegion.geometry) if g.geom_type.startswith('Multi')]
In [ ]:
[(r,len(g.geoms)) for r,g in zip(indicatorsByRegion_prjd.index,indicatorsByRegion_prjd.geometry)  if g.geom_type.startswith('Multi')]

It seems AFRICA issue was solved, but not EAST AND SOUTHEAST ASIA. Thee seems to be a really big issue in those borders (Mongolia and China). Let's explore:

In [ ]:
china=indicators[indicators.Country.isin(['CHINA'])]
mongolia=indicators[indicators.Country.isin(['MONGOLIA'])]

china.overlay(mongolia, how='intersection',keep_geom_type=False).geometry

So, we have some really bad situation:

  • There is an intersection between two countries, and there should be none.
  • There intersection includes objects other than polygons: 'GEOMETRYCOLLECTION'

See:

In [ ]:
# Quick count of objects in the GeometryCollection
result_geom = china.overlay(mongolia, how='intersection',keep_geom_type=False).geometry.iloc[0]
if result_geom.geom_type == 'GeometryCollection':
    print(f"Objects in collection: {len(result_geom.geoms)}")
    from collections import Counter
    print(dict(Counter(g.geom_type for g in result_geom.geoms)))
In [ ]:
## see the intersection:
base=china.plot(color='lightgrey')
mongolia.plot(color='yellow',ax=base)
china.overlay(mongolia, how='intersection',keep_geom_type=False).plot(ax=base)

The solution to this, believe me, will not be trivial: the border is not continuous, and creating a 'new frontier' between China and Mongolia will demand more functions than the ones we have taught so far. Situations like this require smart decisions, like get a new map with a better quality.