
The Geo Dataframe¶
The geodataframe (GDF) is a dataframe (DF) where every row represents an spatial element (point, line, polygon).
The most common file type that stores spatial elements is the shapefile. Let's take a look at some of them:
- In GitHub (cloud), create a repository named: introgeodf.
- Clone that repo to a local folder in your computer.
- In that local folder in your computer, create a folder named maps.
- Go to Paidea and download three compressed files from the folder WorldMaps.
You may see something like this:

You can decompress those files:

Now, take a look a World_Countries:

There, you see that this one map requires several files. That is the nature of the shapefile.
Let's read the file with the help of geopandas:
import os, geopandas as gpd
countries=gpd.read_file(os.path.join("maps","World_Countries","World_Countries.shp"))
Let's use some familiar DF functions:
# what is it?
type(countries)
# dimensions
countries.shape
# names
countries.columns
# some content
countries.head()
# what geometry?
countries.geom_type.unique()
# any missing values?
countries[countries.isna().any(axis=1)]
# types
countries.info()
As you see, those pandas commands are working fine, but now we have a new column type: geometry. Let's see this map of countries:
countries.plot(facecolor="azure",#color of polygon fill
edgecolor='red', #color of lines
linewidth=0.1) #thickness of lines
Let's open the other maps:
rivers=gpd.read_file(os.path.join("maps","World_Hydrography","World_Hydrography.shp"))
cities=gpd.read_file(os.path.join("maps","World_Cities","World_Cities.shp"))
# what geo?
rivers.geom_type.unique(), cities.geom_type.unique()
This is the rivers map:
rivers.plot(edgecolor='blue',
linewidth=1,
linestyle='dotted')
This is the cities map:
cities.plot(marker='.', # marker type
color='red',
markersize=1,
alpha=0.3) # transparency
You can start by creating the layer on the back (the base), and add layers on top:
base = countries.plot(facecolor="white",
edgecolor='black',
linewidth=0.1,
figsize=(12,12))
rivers.plot(edgecolor='blue', linewidth=0.4,
ax=base)# on top of...
cities.plot(marker='.', color='red', markersize=1,alpha=0.7,
ax=base) # on top of...
Saving into a different format (not shapefile):
# ONE file - SEVERAL layers
import os
countries.to_file(os.path.join("maps","worldMaps.gpkg"), layer='countries', driver="GPKG")
rivers.to_file(os.path.join("maps","worldMaps.gpkg"), layer='rivers', driver="GPKG")
cities.to_file(os.path.join("maps","worldMaps.gpkg"), layer='cities', driver="GPKG")
Map Projection¶
The CRS is a very important property of the maps. They affect three some aspects:
- shape
- area
- scale
- direction
Most maps come with a default CRS: 4326. Pay attention:
brazil=countries[countries.COUNTRY=='Brazil']
brazil.crs
# check units of measurement
brazil.crs.axis_info
# is this CRS projected?
brazil.crs.is_projected
Polygons have a centroid. When we try getting a centroid from an unprojected polygon, you get:
# centroid
brazil.centroid
# recommended for Brazil (meters)
brazil.to_crs(5641).crs.axis_info
# now this works with no warning
brazil.to_crs(5641).centroid
# replotting:
base5641=brazil.to_crs(5641).plot()
brazil.to_crs(5641).centroid.plot(color='red',ax=base5641)
Let's keep the projected version for all our maps:
cities_brazil_5641=cities[cities.COUNTRY=='Brazil'].to_crs(5641)
riversBrazil_clipped=gpd.clip(rivers,brazil)
brazil_5641=brazil.to_crs(5641)
rivers_brazil_5641=riversBrazil_clipped.to_crs(brazil_5641.crs)
## saving
import os
brazil_5641.to_file(os.path.join("maps","brazilMaps_5641.gpkg"), layer='country', driver="GPKG")
cities_brazil_5641.to_file(os.path.join("maps","brazilMaps_5641.gpkg"), layer='cities', driver="GPKG")
rivers_brazil_5641.to_file(os.path.join("maps","brazilMaps_5641.gpkg"), layer='rivers', driver="GPKG")
#brazil_5641.centroid.to_file(os.path.join("maps","brazilMaps_5641.gpkg"), layer='centroid', driver="GPKG")
Exercise 1¶
- Reproject your country's map layers.
- Plot the reprojected layers
- Save the reprojected layers as gpkg.
Creating Spatial data¶
You have a "data"folder in Paideia with a CSV file with information on the airports in Brazil. Create a similar 'data' folder in your local computer, inside the current repo.
Let's open the CSV:
import pandas as pd
infoairports=pd.read_csv(os.path.join("data","br-airports.csv"))
# some rows
infoairports.iloc[[0,1,2,3,-4,-3,-2,-1],:] #head and tail
This needs some cleaning:
# bye first row
infoairports.drop(index=0,inplace=True)
infoairports.reset_index(drop=True, inplace=True)
infoairports.head()
# keep the columns needed
infoairports.columns
keep=['name','type','latitude_deg', 'longitude_deg','elevation_ft','region_name','municipality']
infoairports=infoairports.loc[:,keep]
infoairports.info()
Some formatting:
numericCols=['latitude_deg', 'longitude_deg','elevation_ft']
infoairports[numericCols]=infoairports.loc[:,numericCols].apply(lambda x:pd.to_numeric(x))
# now
infoairports.info()
# let's plot
base = brazil_5641.plot(color='white', edgecolor='black') #unprojected
infoairports.plot.scatter(x = 'longitude_deg', y = 'latitude_deg',ax=base)
Why is it wrong?
airports=gpd.GeoDataFrame(data=infoairports.copy(),
geometry=gpd.points_from_xy(infoairports.longitude_deg,
infoairports.latitude_deg),
crs=brazil.crs.to_epsg())# the coordinates were in degrees - unprojected
# let's plot
base = brazil_5641.plot(color='white', edgecolor='black')
airports.plot(ax=base)
#remember:
type(airports), type(infoairports)
Let's re project!
airports_5641=airports.to_crs(5641)
## then
base = brazil_5641.plot(color='white', edgecolor='black')
airports_5641.plot(ax=base)
Remember you have type of airports:
airports_5641['type'].value_counts() # this will not work: airports.type.value_counts()
We may use that in the future. For now, just rename the type column to a different one.
airports_5641.rename(columns={'type':'kind'},inplace=True)
## adding the airports to GPKG
airports_5641.to_file(os.path.join("maps","brazilMaps_5641.gpkg"), layer='airports', driver="GPKG")
Formating Geoseries projections¶
You know brazil_5641 is a multipolygon:
brazil_5641
Sometime, you just need the border (lines):
brazil_5641.boundary
# This is just the borderline
brazil_5641.boundary.plot()
Always check the data type:
# does 'boundary' return a GDF?
type(brazil_5641.boundary)
Some operations in geopandas require GDF or GS. If you need a GDF instead of a GS:
# converting into GDF
brazil_5641.boundary.to_frame()
brazil_5641.boundary.to_frame().info()
Notice you get a very simple GDF, and you may want to add some information:
# conversion
brazil_border=brazil_5641.boundary.to_frame()
# new column (optional)
brazil_border['name']='Brazil'
# renaming the geometry column
brazil_border.rename(columns={0:'geometry'},inplace=True)
#setting the geometry (the name is not enough)
brazil_border = brazil_border.set_geometry("geometry")
# verifying:
brazil_border.crs
brazil_border
You can add this GDF as a layer.
Exercise 3¶
Check if your country is a polygon or multipolygon.
Recover just the boundaries of that country.
Turn the boundary into a GDF.
Maps Lacking CRS information¶
Reprojecting seems a simple process, but you might find some interesting cases.
Download the compressed file "Brazil_subnational". Unzip or decompress that files. Move the decompressed folder into your current maps folder.
Let's read the maps on states(adm1) and municipalities (adm2):
brazil_states=gpd.read_file(os.path.join("maps","bra_adm_ibge_2020_shp","bra_admbnda_adm1_ibge_2020.shp"))
brazil_municipalities=gpd.read_file(os.path.join("maps","bra_adm_ibge_2020_shp","bra_admbnda_adm2_ibge_2020.shp"))
They are maps, for sure:
type(brazil_states), type(brazil_municipalities)
brazil_states.geometry.head()
brazil_municipalities.geometry.head()
But, notice this:
brazil_states.crs, brazil_municipalities.crs
They do not have crs information, however they can be plotted:
import matplotlib.pyplot as plt
fig, (ax1, ax2) = plt.subplots(ncols=2, sharex=False, sharey=False, figsize=(12,12))
brazil_states.plot(ax=ax1, facecolor='lightgrey', edgecolor='black')
brazil_municipalities.plot(ax=ax2, facecolor='lightgrey', edgecolor='black',linewidth=0.2)
Since we are using the crs 5641 for Brazil, the initial strategy could be to set the CRS with the right projection :
## uncomment this to see the error message
# brazil_states.to_crs(5641)
Python says "Please set a crs on the object first". This would mean to know the actual projection, of the geometry:
From the plots above and the rows seen, we conclude the maps are unprojected; then:
# set as unprojected
brazil_states.crs = "EPSG:4326"
brazil_municipalities.crs = "EPSG:4326"
Now, we can reproject:
brazil_states=brazil_states.to_crs(5641)
brazil_municipalities=brazil_municipalities.to_crs(5641)
brazil_states.to_file(os.path.join("maps","brazilMaps_5641.gpkg"), layer='states', driver="GPKG")
brazil_municipalities.to_file(os.path.join("maps","brazilMaps_5641.gpkg"), layer='municipalities', driver="GPKG")
Exercise 4¶
- Look for sub administrative divisions of your country
- Check all the CRSs of those divisions
- If you find one CRS is missing, fill the CRS with the right projection. If not, just state nothing is to be done.
Geo Merging¶
The countries map has no interesting information beyond the geometry.
countries.head()
Let add some information to each country:
import pandas as pd
fragilityCiaLink="https://github.com/CienciaDeDatosEspacial/merging/raw/main/FragilityCia_isos.csv"
fragilityCia=pd.read_csv(fragilityCiaLink)
fragilityCia.head()
We want to add the fragilityCia data into the map. That is the merging process. For that, we need a common column. The Country column is the option.
# to upper case.
countries['COUNTRY']=countries.COUNTRY.str.upper()
It is very unlikely the names are written the same. Verify:
onlyFragilCia=set(fragilityCia.Country)- set(countries.COUNTRY)
onlyMap=set(countries.COUNTRY)- set(fragilityCia.Country)
Check here:
onlyFragilCia
# and here
onlyMap
Fuzzy merging¶
Let's find similar names:
# !pip install thefuzz
from thefuzz import process
[(country, process.extractOne(country,onlyMap)) for country in sorted(onlyFragilCia)]
# keeping high scores
[(country, process.extractOne(country,onlyMap)) for country in sorted(onlyFragilCia)
if process.extractOne(country,onlyMap)[1]>=90]
Preparing a dict of changes:
# then:
try1={country: process.extractOne(country,onlyMap)[0] for country in sorted(onlyFragilCia)
if process.extractOne(country,onlyMap)[1]>=90}
try1
Making changes and updating:
fragilityCia.replace(to_replace={'Country':try1},inplace=True)
# updating
onlyFragilCia=set(fragilityCia.Country)- set(countries.COUNTRY)
onlyMap=set(countries.COUNTRY)- set(fragilityCia.Country)
# new matches
[(country, process.extractOne(country,onlyMap)) for country in sorted(onlyFragilCia)]
# some manual
countries[countries.COUNTRY.str.contains('LAO|SWA|KOR')]
manualChanges={'SWAZILAND':'ESWATINI','LAOS':"LAO PEOPLE'S DEMOCRATIC REPUBLIC (THE)",'SOUTH KOREA':'KOREA (THE REPUBLIC OF)'}
countries.replace(to_replace={'COUNTRY':manualChanges},inplace=True)
# updating
onlyFragilCia=set(fragilityCia.Country)- set(countries.COUNTRY)
onlyMap=set(countries.COUNTRY)- set(fragilityCia.Country)
# new matches
[(country, process.extractOne(country,onlyMap)) for country in sorted(onlyFragilCia)]
# then:
try2={country: process.extractOne(country,onlyMap)[0] for country in sorted(onlyFragilCia)}
try2
# changing
fragilityCia.replace(to_replace={'Country':try2},inplace=True)
# new update
onlyFragilCia=set(fragilityCia.Country)- set(countries.COUNTRY)
onlyMap=set(countries.COUNTRY)- set(fragilityCia.Country)
# new matches
[(country, process.extractOne(country,onlyMap)) for country in sorted(onlyFragilCia)]
We can not improve the situation.
Now, when you merge a GDF with a DF, the GDF has to be on the left:
theMapAndData=countries.merge(fragilityCia,left_on='COUNTRY', right_on='Country')
theMapAndData.drop(columns=['Country'],inplace=True) # no need for this column
# here it is (new map):
theMapAndData.info()
DataNames=['fragility', 'co2', 'ForestRev_gdp']
pd.melt(theMapAndData[DataNames])
import seaborn as sns
import matplotlib.pyplot as plt
sns.displot(pd.melt(theMapAndData[DataNames]),
x="value", hue="variable",kind="kde",
log_scale=(False,False))
The variables are in different units, we should try a data rescaling strategy:
# !pip install -U scikit-learn
- StandardScaler:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
normalized_data = scaler.fit_transform(theMapAndData[DataNames])
sns.displot(pd.melt(pd.DataFrame(normalized_data,columns=DataNames)),
x="value", hue="variable",kind="kde",
log_scale=(False,False))
- MinMaxScaler:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data=scaler.fit_transform(theMapAndData[DataNames])
sns.displot(pd.melt(pd.DataFrame(scaled_data,columns=DataNames)),
x="value", hue="variable",kind="kde",
log_scale=(False,False))
- RobustScaler:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
robScaled_data = scaler.fit_transform(theMapAndData[DataNames])
sns.displot(pd.melt(pd.DataFrame(robScaled_data,columns=DataNames)),
x="value", hue="variable",kind="kde",
log_scale=(False,False))
- QuantileTransformer:
from sklearn.preprocessing import QuantileTransformer
scaler = QuantileTransformer(n_quantiles=99, random_state=0,output_distribution='normal') #or 'uniform'
QtScaled_data = scaler.fit_transform(theMapAndData[DataNames])
sns.displot(pd.melt(pd.DataFrame(QtScaled_data,columns=DataNames)),
x="value", hue="variable",kind="kde",
log_scale=(False,False))
Let's keep the last one:
theMapAndData['fragility_Qt']=QtScaled_data[:,0]
Discretizing¶
I will keep the data_Qt data frame. Now, I want cut the data. Please install numba before runing the next code; also make sure you have pysal, mapclassify and numpy installed:
! pip show numba mapclassify numpy
# !pip install mapclassify
Let me discretize fragility_Qt:
import mapclassify
import numpy as np
np.random.seed(12345) # so we all get the same results!
# let's try 5 intervals
K=5
theVar=theMapAndData.fragility_Qt
# same interval width, easy interpretation
ei5 = mapclassify.EqualInterval(theVar, k=K)
# same interval width based on standard deviation, easy - but not as the previous one, poor when high skewness
msd = mapclassify.StdMean(theVar)
# interval width varies, counts per interval are close, not easy to grasp, repeated values complicate cuts
q5=mapclassify.Quantiles(theVar,k=K)
# based on similarity, good for multimodal data
mb5 = mapclassify.MaximumBreaks(theVar, k=K)
# based on similarity, good for skewed data
ht = mapclassify.HeadTailBreaks(theVar) # no K needed
# based on similarity, optimizer
fj5 = mapclassify.FisherJenks(theVar, k=K)
# based on similarity, optimizer
jc5 = mapclassify.JenksCaspall(theVar, k=K)
# based on similarity, optimizer
mp5 = mapclassify.MaxP(theVar, k=K)
How can we select the right classification? Let me use the the Absolute deviation around class median (ADCM) to make the comparisson:
class5 = ei5,msd, q5,mb5, ht, fj5, jc5, mp5
# Collect ADCM for each classifier
fits = np.array([ c.adcm for c in class5])
# Convert ADCM scores to a DataFrame
adcms = pd.DataFrame(fits)
# Add classifier names
adcms['classifier'] = [c.name for c in class5]
# Add column names to the ADCM
adcms.columns = ['ADCM', 'Classifier']
Now, plot the adcms:
adcms.sort_values('ADCM').plot.barh(x='Classifier')
Let's save the best strategy:
theMapAndData['fragility_Qt_jc5'] = jc5.yb
# there you are
theMapAndData[['fragility_Qt','fragility_Qt_jc5']].head()
Let's check the mean of 'fragility_Qt' by the labels of the columns created (from '0' to '4')
indexList=['fragility_Qt_jc5'] # add more?
aggregator={'fragility_Qt': ['mean']}
pd.concat([theMapAndData[['fragility_Qt',col]].groupby(col,as_index=False).agg(aggregator) for col in indexList],axis=1)
We could create a new column:
# renaming
newLabelsForLevels={0:"0_Great", 1:"1_Good", 2:"2_Middle", 3:"3_Bad", 4:"4_Poor"}
theMapAndData['fragility_Qt_jc5_cat']=theMapAndData.loc[:,'fragility_Qt_jc5'].replace(newLabelsForLevels)
# we have
theMapAndData[['fragility_Qt','fragility_Qt_jc5','fragility_Qt_jc5_cat']].head(20)
We are ready for a choropleth:
import matplotlib.pyplot as plt
f, ax = plt.subplots(1, figsize=(10, 10))
theMapAndData.plot(column='fragility_Qt_jc5_cat', # variable to plot
cmap='viridis', # set of colors
categorical=True, # can be interpreted as category
edgecolor='white', # border color
linewidth=0., # width of border
alpha=1, # level of transparency (0 is invisible)
legend=True, # need a legend?
# location of legend: 'best', 'upper right', 'upper left', 'lower left',
# 'lower right', 'right', 'center left', 'center right',
# 'lower center', 'upper center', 'center'
legend_kwds={'loc':"lower left"},
ax=ax
)
ax.set_axis_off()
However, once you know the ADCM, you can request the choropleth without creating a variable:
import matplotlib.pyplot as plt
f, ax = plt.subplots(1, figsize=(10, 10))
theMapAndData.plot(column='fragility_Qt',
cmap='OrRd',
scheme="jenkscaspall",k=5,
edgecolor='grey',
linewidth=0.5,
alpha=1,
legend=True,
legend_kwds={'loc':3},
ax=ax
)
ax.set_axis_off()
# finally
theMapAndData.to_file(os.path.join("maps","worldMaps.gpkg"), layer='indicators', driver="GPKG")
Exercise 5¶
- Transform the co2 and forest variables.
- Discretize the result chosen.
- Make the maps for the co2 and forest variables.
- Add another variable (merge) from the web (or any other source). Transform it , discretize it, and map it.