Data Classification - Programming with Spatial Data

Categorizing Spatial Data

In the previous sections, we learned how to generate new spatial datasets, engineer new columns, and combine geometries. You now have a GeoDataFrame packed with precise, raw numbers.

However, if you try to plot these raw numbers directly, the resulting map can be confusing. This section marks the transition from spatial analysis to spatial communication. You will learn why we must normalize our data and how to use Python rules and statistical algorithms to categorize (classify) your data into readable groups.

1. Why Classify Data?¶

When visualizing data within geographic boundaries (like cantons or municipalities), we create what is called a Choropleth Map or a thematic map.

Before we assign colors to our polygons, we must overcome two fundamental problems: The Area Bias and The Continuous Brain Limit.

The Area Bias (Why we need Ratios)¶

When creating a choropleth map (where entire polygons are filled with color), you must be very careful about mapping raw counts, such as the total population or the total number of crimes.

Raw counts are spatially extensive variables, meaning their values are directly correlated with the physical size of the observational unit. Everything else being equal, a larger municipality will naturally have more houses, more people, and more total crimes simply because it has more space. If you map total population, the largest polygons will almost always appear the darkest, misleading the viewer into thinking they are the most crowded or dangerous areas.

To fix this area bias, we must convert our counts into spatially intensive variables by creating a ratio or density. By dividing the raw count by a measure of size (like total area or total population at risk), we reveal the intrinsic variation of the data, regardless of how large or small the municipal borders happen to be.

Let us load the Swiss municipalities and prove this visually by comparing the extensive total population to the intensive population density.

import geopandas as gpd
import matplotlib.pyplot as plt

# Load municipalities and ensure metric CRS
muni_gdf = gpd.read_file("swissBoundaries3D_municipalities.gpkg").to_crs(epsg=2056)

# Calculate Population Density (People per hectare)
# gem_flaeche is the municipal area in hectares
muni_gdf["pop_density"] = muni_gdf["einwohnerzahl"] / muni_gdf["gem_flaeche"]

# Create a 1x3 subplot to compare the raw data vs the ratio
# Increased figsize slightly to give the horizontal legends room to breathe
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18, 6))

# Plot 1: Raw Population
muni_gdf.plot(
    column="einwohnerzahl",
    ax=ax1,
    cmap="Blues",
    legend=True,
    legend_kwds={"orientation": "horizontal", "label": "Total Population"},
)
ax1.set_title("1. Population")
ax1.axis("off")

# Plot 2: Raw Area
muni_gdf.plot(
    column="gem_flaeche",
    ax=ax2,
    cmap="Greens",
    legend=True,
    legend_kwds={"orientation": "horizontal", "label": "Area (Hectares)"},
)
ax2.set_title("2. Area")
ax2.axis("off")

# Plot 3: Population Density (The correct way)
muni_gdf.plot(
    column="pop_density",
    ax=ax3,
    cmap="Reds",
    legend=True,
    legend_kwds={"orientation": "horizontal", "label": "People per Hectare"},
)
ax3.set_title("3. Population Density")
ax3.axis("off")

plt.tight_layout()
plt.show()

Click through the tabs below to see how plotting a spatially extensive variable creates a visual illusion, and how a spatially intensive ratio fixes it.

1. Raw Population

2. Raw Area

3. Population Density

A map of Switzerland showing raw population in blue. Many physically large municipalities appear in medium to dark blue. — Raw population is a spatially extensive variable. Notice how many of the largest municipalities appear dark blue simply because they have more physical space, incorrectly suggesting they are the most crowded areas.

Concept Check: The CO2 Map¶

Scenario: You have been asked to create a map showing the carbon footprint of European countries. You have two columns of data: total_co2_emissions (Total tons of CO2) and co2_per_capita (Tons of CO2 per person).

If your goal is to show which country’s citizens have the least sustainable lifestyle, which column should you map, and why?

A) total_co2_emissions, because it is a spatially intensive variable.
B) total_co2_emissions, because it is a spatially extensive variable.
C) co2_per_capita, because it is a spatially intensive variable.

Check your understanding

Answer: C total_co2_emissions is a spatially extensive raw count. If you map it, a massive country with a huge population like Germany will naturally look like the worst offender simply because it is larger and has more people. By mapping co2_per_capita (a ratio), you create a spatially intensive variable that reveals the true intrinsic lifestyle footprint of the average citizen, regardless of the country’s total size.

The Continuous Brain Limit (Why we need Bins)¶

Even after fixing the area bias, we have another problem. Our pop_density column contains thousands of unique, continuous decimal values. If we pass this continuous data to a map, the computer will generate thousands of slightly different shades of red.

The human eye cannot distinguish between 100 different shades of red. To make the map more readable, we can simplify our thousands of unique values into a small set of five logical categories (e.g., “Very Low”, “Low”, “Medium”, “High”, “Very High”).

2. Rule-based Classification¶

The most straightforward way to classify data is to write your own custom logic. This is highly effective when you have specific, scientifically meaningful thresholds (e.g., an environmental safety limit, or legal zoning density rules).

We can write a simple Python function to evaluate each row and assign a text label based on the population density. We then use the Pandas .apply() method to run this function across the entire dataset.

Once the data is classified into text labels (categories), there is a common trap beginners fall into: if you just plot the text strings, Python will automatically sort your legend alphabetically (High, Low, Medium, Very High, Very Low), which completely breaks the visual logic of a color ramp!

To fix this, we can quickly convert the new column into a Pandas Categorical data type to enforce a strict, logical order before plotting.

import pandas as pd
import matplotlib.pyplot as plt


# 1. Define a custom classification function
def classify_density(density_value):
    if density_value < 4:
        return "Very Low"
    elif 4 <= density_value < 12:
        return "Low"
    elif 12 <= density_value < 24:
        return "Medium"
    elif 24 <= density_value < 49:
        return "High"
    else:
        return "Very High"


# 2. Apply the function to create a new classified column
muni_gdf["density_class"] = muni_gdf["pop_density"].apply(classify_density)

# 3. Enforce a logical order for our categories (prevents alphabetical sorting in the legend)
category_order = ["Very Low", "Low", "Medium", "High", "Very High"]
muni_gdf["density_class"] = pd.Categorical(
    muni_gdf["density_class"], categories=category_order, ordered=True
)

# 4. Plot the beautifully ordered categorical map
fig, ax = plt.subplots(figsize=(12, 8))

muni_gdf.plot(
    column="density_class",
    categorical=True,  # Tell GeoPandas these are distinct groups, not a continuous scale
    cmap="YlOrRd",  # Yellow to Orange to Red color map
    linewidth=0.1,
    edgecolor="darkgrey",
    legend=True,
    legend_kwds={"title": "Density Class", "loc": "upper left"},
    ax=ax,
)

ax.set_title("Rule-Based Classification of Population Density")
ax.axis("off")
plt.show()

# Let's peek at the actual data table too!
display(muni_gdf[["name", "pop_density", "density_class"]].head(3))

A map of Switzerland showing population density classified into five categories from 'Very Low' to 'Very High'. The map uses a yellow-to-red color ramp, with yellow representing very low density and dark red representing very high density. The legend is correctly ordered from lowest to highest. — *Applying custom rules. Now, instead of thousands of confusing continuous numbers, our map only has to display 5 distinct, easily understandable categories!*

3. Classification Schemes¶

Writing custom rules is great, but what if you do not know the data well enough to guess the best threshold values?

Spatial data science relies on statistical grouping algorithms to automatically slice data into logical bins. In Python, GeoPandas integrates directly with a library called mapclassify to handle this seamlessly behind the scenes.

Here are the four most common statistical classification schemes:

Equal Intervals (equal_interval): Divides the data range into equal-sized sub-ranges (e.g., 0-25, 25-50, 50-75). Warning: Highly sensitive to extreme outliers; most of your data might end up stuffed into a single bin.
Quantiles (quantiles): Divides the data so that every class contains the exact same number of polygons (e.g., the top 20% of municipalities, the next 20%, etc.). Excellent for showing relative rankings, but visually distorts the true numerical gaps between values.
Natural Breaks / Fisher-Jenks (natural_breaks): A smart algorithm that looks for “valleys” in the data distribution histogram. It minimizes variance within classes and maximizes variance between classes. This is often the best default choice for geographic data.
Pretty Breaks (prettybreaks): Attempts to round the class break values to numbers that look “nice” to humans (like 10, 20, 30 instead of 11.4, 19.8, 32.1).

You do not even need to write the complex mathematical code for these. GeoPandas allows you to pass the scheme directly into the .plot() command, along with k to define the number of bins!

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 6))

# Plotting with a statistical classification scheme
muni_gdf.plot(
    column="pop_density",
    scheme="quantiles",  # Change this string to test different algorithms!
    k=5,  # Number of bins
    cmap="Reds",
    legend=True,
    legend_kwds={"title": "Pop Density (Quantiles)", "loc": "upper left"},
    linewidth=0.1,
    edgecolor="darkgrey",
    ax=ax,
)

ax.set_title("Statistical Classification of Population Density")
ax.axis("off")
plt.show()

Let us compare how these four different algorithms slice the exact same pop_density data into 5 classes. Click through the tabs below and pay close attention to the numbers in the legends!

Equal Intervals

Quantiles

Natural Breaks

Pretty Breaks

A map of Switzerland showing population density using equal intervals. Almost the entire map is the lightest shade of red, with only a tiny handful of dots appearing in darker red. — Equal Intervals divides the math, not the data. Because cities like Geneva have a massive density (over 120 people/hectare), the intervals are split into chunks of ~25. Since 99% of Swiss municipalities have a density lower than 25, almost the entire map ends up stuffed into the very first bin, making it mostly unreadable.

Visualizing the Class Breaks¶

To truly understand what the computer just did, examination of the data distribution is highly beneficial to see exactly where the mathematical rules chopped the numbers.

In the tabs below, we’ve plotted a simplified histogram of our population density values, with vertical dashed lines indicating the new class break points.

Equal Interval Breaks

Quantile Breaks

Natural Breaks

User Defined (Pretty) Breaks

A histogram showing population density counts for Swiss municipalities. Five vertical dashed black lines are placed at perfectly evenly spaced intervals across the numerical range of the population density values, mostly out in the empty tail of the distribution. — Equal intervals produce class break lines that are perfectly evenly spaced across the numerical range of the population density values. Notice how this fails to capture the massive cluster of rural municipalities near zero, all of which get clumped into the very first category.

Concept Check: The Top 20%¶

Scenario: A real estate company asks you to create a map of municipal housing prices. They want the legend to specifically highlight the “Top 20% most expensive municipalities” in dark red, the next 20% in medium red, and so on.

Which classification scheme must you use to guarantee that exactly 20% of the map’s polygons fall into each category?

A) Natural Breaks (natural_breaks)
B) Quantiles (quantiles)
C) Equal Intervals (equal_interval)

Check your understanding

Answer: B Quantiles are specifically designed to distribute an equal number of observations (polygons) into each class. If you set k=5, the algorithm will perfectly sort the data and assign exactly 20% of the municipalities into each of the 5 bins, perfectly matching the real estate company’s request.

4. Exercise: Water Accessibility¶

It is time to practice engineering a ratio and classifying it statistically. We want to understand water accessibility across Switzerland by calculating how many people share each hectare of lake area in a given municipality. (Think of this as a rough proxy for how crowded the local beaches might be!)

However, geography is messy. Some municipalities have lakes but zero residents. Many municipalities have thousands of residents but zero lakes! You will need to handle these missing data points cleanly before classifying the results.

Tasks:

Load Data: If you have not already, load swissBoundaries3D_municipalities.gpkg and ensure it is in the metric EPSG:2056 projection.
Handle the Zeros: If a municipality has 0 lake area (see_flaeche == 0), dividing by it will cause a mathematical error (Infinity). Replace all 0 values in the see_flaeche column with np.nan (Not a Number) to safely ignore them.
Calculate the Ratio: Create a new column called water_ratio by dividing the population (einwohnerzahl) by the newly cleaned lake area (see_flaeche).
Plot and Classify: Plot the water_ratio column. Use the quantiles scheme with k=5 classes. Use the missing_kwds argument to color all the municipalities that had no lakes in lightgrey.

# Write your code here

Sample solution

import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt

# 1. Ensure data is loaded and projected
muni_gdf = gpd.read_file("swissBoundaries3D_municipalities.gpkg").to_crs(epsg=2056)

# 2. Handle the Zeros to prevent division by zero errors
# We replace 0 with np.nan so Pandas knows it is "missing" data
muni_gdf["see_flaeche"] = muni_gdf["see_flaeche"].replace(0, np.nan)

# 3. Calculate the Ratio (Population per hectare of lake)
muni_gdf["water_ratio"] = muni_gdf["einwohnerzahl"] / muni_gdf["see_flaeche"]

# 4. Plot, Classify, and style the missing data
ax = muni_gdf.plot(
    figsize=(12, 8),
    column="water_ratio", 
    scheme="quantiles", 
    k=5, 
    cmap="Blues", 
    legend=True,
    legend_kwds={"loc": "upper left", "title": "Pop per Hectare of Lake"},
    missing_kwds={"color": "lightgrey", "label": "No Lake Area"},
    edgecolor="white",
    linewidth=0.1
)

ax.set_title("Water Accessibility: Municipalities with Lakes")
ax.axis("off")
plt.show()

A map of Switzerland where the vast majority of municipalities are colored light grey. The remaining municipalities, which contain lakes, are colored in five shades of blue. The darkest blue areas indicate high populations sharing small lake areas. — Visualizing the classified ratio. By explicitly replacing zeros with missing data (`np.nan`) and utilizing `missing_kwds`, we successfully separated the municipalities without lakes (light grey) from those we actually wanted to analyze and classify (blue).

5. Summary: Prepping for Visualization¶

In this section, you bridged the critical gap between raw spatial analysis and human readability. By transforming your numbers into organized classes, you have prepared your data for the final stage of geographic information systems: map making.

Key takeaways¶

Always Normalize: Never map raw counts (like total population) on polygon maps, as it introduces severe area bias. Always convert counts into a ratio or density first.
The Necessity of Bins: Human eyes cannot interpret continuous color ramps accurately. Grouping data into distinct classes makes patterns immediately readable.
Rule-Based: Use custom Python logic (like if/elif statements combined with .apply()) when you have specific, meaningful thresholds you want to highlight.
Statistical Schemes: Use mapclassify built directly into GeoPandas plotting (e.g., scheme="natural_breaks") to let algorithms find the best way to slice your data. Choose Quantiles for even color distribution, and Natural Breaks for mathematically accurate clustering.

What comes next?¶

You now know how to manage CRSs, manipulate geometries, join distinct layers together, and classify the resulting data into logical bins.

You have all the ingredients required to build professional maps. In the next lesson on Data Visualization, we will formally introduce the visualization libraries that will allow you to add titles, basemaps, scale bars, and interactive legends to turn your Python code into presentation-ready cartography!

But first, it is time to put the new concepts of this lesson to the test in the practical session.