Aggregating and Joining - Programming with Spatial Data

Extracting Insights from Relational Data

Before we begin, make sure you have downloaded the necessary datasets for this section. We will be using our familiar Kloten datasets, plus a new extended dataset from Lugano.

Let us load our primary extended Kloten dataset. We will also extract the month from the date string right away, as we will need it for grouping later.

import pandas as pd

# Load the extended Kloten dataset
fp = "kloten_summer_2022_extended.txt"
data = pd.read_csv(fp, sep=r"\s+", skiprows=[1], na_values=["-9999"])

# Extract the month (characters 4 and 5 from the YYYYMMDD string)
data["month"] = data["DATE"].astype(str).str.slice(4, 6)

display(data.head(3))

Extended Kloten Dataset (with month)

	DATE	tmin	tmax	tmean	rh	wind_speed	radiation	month
0	20220601	11.1	19.6	14.8	83.1	1.8	126.7	06
1	20220602	12.3	21.8	17.0	81.4	2.3	231.3	06
2	20220603	12.7	24.8	NaN	82.2	2.4	198.1	06

1. Unique Values and Counts¶

Sometimes you want to know exactly what distinct categories exist within a specific column. Pandas provides two simple methods for this: .unique() to list the distinct values, and .nunique() to count them.

For example, let us confirm exactly which months are represented in our dataset.

# See the actual unique values
print("Unique months:", data["month"].unique())

# Count the number of unique values
print("Number of unique months:", data["month"].nunique())

Expected Output

Unique months: ['06' '07' '08']
Number of unique months: 3

(Note: Notice the lack of commas in the Unique months list? That is because .unique() returns a highly optimized NumPy array, not a standard Python list!)

Counting Category Frequencies¶

Knowing the unique categories is helpful, but usually, we also want to know how many rows belong to each category.

To answer “how many of each?”, Pandas provides the incredibly useful .value_counts() method. It instantly calculates the frequency of every unique value in a column and sorts them from most to least common.

# Count how many days of data we have for each specific month
display(data["month"].value_counts())

These three methods (unique, nunique, and value_counts) form your standard toolkit for exploring categorical data. This is incredibly useful for tasks like checking how many unique weather stations are present in a massive global dataset, or finding out which station recorded the most data points!

Concept Check: Which counting tool?¶

Imagine you are analyzing a dataset of 10,000 recorded bird sightings. You want to know exactly how many different species of birds were observed in total (e.g., the answer should be a single number, like 45).

Which method should you use on the species column?

Check your understanding

You should use .nunique().

.unique() would return a massive array listing all 45 names.
.value_counts() would return a table showing how many times each of the 45 birds was spotted.
.nunique() strictly returns the integer 45 (the number of unique elements), which is exactly what you need!

2. Basic Statistics¶

The DataFrame and Series objects contain built-in methods for generating rapid summary statistics. Rather than writing complex math, you can simply call .mean(), .median(), .min(), .max(), and .std() (standard deviation).

If you call these on a single column, you get a single number:

# Find the absolute maximum temperature of the entire summer
hottest_temp = data["tmax"].max()
print(f"The highest temperature was {hottest_temp}°C")

If you want a comprehensive overview of your entire dataset, the .describe() method is your best friend. It generates a statistical summary for all numeric columns at once, automatically ignoring missing NaN values.

# Generate summary statistics for specific columns
columns_to_check = ["tmax", "tmin", "wind_speed"]
display(data[columns_to_check].describe().round(2))

Output of describe()

	tmax	tmin	wind_speed
count	92.00	92.00	92.00
mean	27.07	13.95	2.53
std	4.01	2.37	0.68
min	18.60	7.30	1.10
25%	24.33	12.70	2.10
50%	27.25	13.60	2.40
75%	29.50	15.38	3.00
max	36.00	19.40	4.20

3. Grouping Data¶

Calculating the average temperature for the entire summer is interesting, but what if we want to know the average temperature for each specific month?

Pandas uses a powerful concept called Split-Apply-Combine, accessible via the .groupby() method. This allows us to:

Split the dataset into groups based on a category (like our month column).
Apply a mathematical function (like mean or sum) to each group independently.
Combine the results back into a neat summary table.

Let us group our data by month and calculate the average temperatures and wind speed. To understand the mechanics, we will write it out step-by-step first:

# 1. SPLIT: Group the data by the 'month' column
grouped_data = data.groupby("month")

# 2. SELECT: Isolate the columns we want to calculate
columns_to_aggregate = ["tmax", "tmin", "wind_speed"]

# 3. APPLY & COMBINE: Apply the mean() function
monthly_averages = grouped_data[columns_to_aggregate].mean()

display(monthly_averages.round(1))

Output without reset_index()

	tmax	tmin	wind_speed
month
06	25.9	13.1	2.4
07	28.2	14.6	2.6
08	27.1	14.1	2.6

Notice how month is sitting lower than the other column headers? That is because .groupby() automatically turns your grouping category into the DataFrame’s Index!

The Chained Approach¶

Once you understand the Split-Apply-Combine logic, you do not need to write three separate lines of code. Pandas is designed to “chain” these commands together.

We can also add one crucial extra command at the end: .reset_index(). By default, .groupby() turns your grouping category (month) into the index. Resetting the index turns month back into a normal, usable column!

# The elegant, one-line Pandas way:
monthly_averages = (
    data.groupby("month")[["tmax", "tmin", "wind_speed"]].mean().reset_index()
)

# Round the results to 2 decimal places for readability
display(monthly_averages.round(2))

Output of groupby and mean

	month	tmax	tmin	wind_speed
0	06	25.87	13.10	2.40
1	07	28.21	14.60	2.62
2	08	27.09	14.14	2.55

With a single line of chained code, we extracted a high-level insight: July (07) was noticeably hotter and slightly windier on average than June or August!

4. Table Joins¶

In real-world projects, your data is rarely contained in just one file. You might have one table with daily temperature data and another table with complementary measurements like rainfall. To analyze them together, you must join them.

Pandas uses the .merge() method to combine two DataFrames. It looks for a common column (the “key”) in both tables and aligns the rows where the keys match.

Let us load a new supplementary dataset (kloten_summer_2022_rain_sun.csv) that contains daily precipitation and sunshine_duration, and merge it with our extended temperature dataset. We will use the date as our shared key.

# Load the supplementary dataset
# (Since it is a standard CSV, we don't need sep or skiprows!)
supp_fp = "kloten_summer_2022_rain_sun.csv"
supp_data = pd.read_csv(supp_fp)

# Merge the two DataFrames.
# Since the date columns have different names ('DATE' vs 'YEARMODA'),
# we use left_on and right_on to tell Pandas which keys to match!
combined_data = pd.merge(data, supp_data, left_on="DATE", right_on="YEARMODA")

# Remove duplicate key column
combined_data = combined_data.drop(columns=["YEARMODA"])

# Display a few columns from the newly merged massive dataset
columns_to_show = ["DATE", "tmax", "rh", "precipitation", "sunshine_duration"]
display(combined_data[columns_to_show].head(3))

Output of merged DataFrames

	DATE	tmax	rh	precipitation	sunshine_duration
0	20220601	19.6	83.1	2.2	74
1	20220602	21.8	81.4	0.0	281
2	20220603	24.8	82.2	0.3	333

Because both tables shared the exact same dates, Pandas neatly attached the rain and sun columns from the new table right next to the corresponding temperatures in our main table!

Concept Check: The Shared Key¶

You have Table A (containing city_name and population) and Table B (containing mayor_name and city_name).

If you want to merge these tables so you can see the population and the mayor side-by-side, what column name should you pass to the on= parameter in pd.merge()?

Check your understanding

on="city_name"

To successfully stitch two tables together horizontally, you must identify the “key”—the column that exists in both datasets containing identical matching values. In this case, city_name is the only overlapping variable that links the population to the mayor!

Here is what the full code would look like. Because the key column has the exact same name in both tables, we can use the convenient on= parameter instead of typing out left_on and right_on:

combined_table = pd.merge(table_a, table_b, on="city_name")

5. Exercise¶

It is time to put your relational data skills to the test. You have been asked to compare the summer weather of Kloten (near Zurich) with Lugano (in southern Switzerland).

Tasks:

Load the Lugano_summer_2022_extended.txt file into a new DataFrame. Remember to use the same sep, skiprows, and na_values parameters.
Use pd.merge() to combine your newly merged Kloten combined_data DataFrame with the Lugano DataFrame using the DATE column as the key. (Hint: Because both tables have columns named tmean and precipitation, Pandas will automatically add _x and _y to the names to tell them apart. You can control this using the suffixes parameter in your merge function, like suffixes=("_kloten", "_lugano"))
Group the newly merged DataFrame by month. (If your merged DataFrame lost the month column, recreate it by slicing the DATE string!)
Calculate the .mean() for tmean and precipitation for both cities.
Display the final aggregated table (round your results to 1 decimal place). Which city was hotter on average in July?

# Write your code here

Sample solution

# 1. Load Lugano data
lugano_data = pd.read_csv(
    "Lugano_summer_2022_extended.txt", 
    sep=r"\s+", 
    skiprows=[1], 
    na_values=["-9999"]
)

# 2. Merge Kloten and Lugano side by side using the DATE key
# We apply suffixes so we know which columns belong to which city
cities_merged = pd.merge(
    combined_data, 
    lugano_data, 
    on="DATE", 
    suffixes=("_kloten", "_lugano")
)

# 3. Ensure we have a month column for grouping
cities_merged["month"] = cities_merged["DATE"].astype(str).str.slice(4, 6)

# 4. Group by month and calculate averages
target_cols = [
    "tmean_kloten", "tmean_lugano", 
    "precipitation_kloten", "precipitation_lugano"
]

monthly_comparison = cities_merged.groupby("month")[target_cols].mean()

# 5. Display the result rounded to 1 decimal place
display(monthly_comparison.round(1))

Output of monthly comparison

month	tmean_kloten	tmean_lugano	precipitation_kloten	precipitation_lugano
06	19.7	22.5	2.5	5.0
07	21.6	25.4	1.2	3.3
08	20.7	24.0	2.6	2.8

Observation: Lugano’s average temperature in July was significantly higher than Kloten’s (25.4°C vs 21.6°C), highlighting the climatic differences between northern and southern Switzerland! Interestingly, Lugano also saw higher average daily precipitation during these months.

6. Summary¶

You have officially transitioned from just manipulating rows of data to generating actionable knowledge and insights.

Key takeaways¶

Unique Values: Use .unique(), .nunique(), and .value_counts() to quickly understand the distinct categories and their frequencies in your dataset.
Summary Statistics: Methods like .mean(), .max(), and .describe() provide an instant mathematical overview of your columns.
Grouping: The .groupby() method is essential for aggregating data. It splits your data by a category (like a specific month), applies a calculation, and builds a neat summary table.
Joining Tables: The pd.merge() function allows you to stitch separate datasets together horizontally by matching rows based on a shared key column (like a specific date).

What comes next?¶

Throughout this chapter, we relied on a bit of a hack: we used string slicing (.str.slice()) to chop up text and extract the month from our dates. But what if we wanted to calculate the exact number of days between two measurements? Or what if our dates were formatted messily like “Jan 1st, 2022” instead of “20220101”?

Dates are not just strings of text; they are measurable, plottable dimensions.

In the next section, Working with Temporal Data, we will learn how to convert clunky text into intelligent Pandas datetime objects. You will learn how to effortlessly extract years, months, and days using the special .dt accessor, and how to automatically resample time-series data (like converting hourly sensor readings into daily averages) without writing complex code.