Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Extracting Insights from Relational Data

Open In Colab

Before we begin, make sure you have downloaded the necessary datasets for this section. We will be using our familiar Kloten datasets, plus a new extended dataset from Lugano.

Let us load our primary extended Kloten dataset. We will also extract the month from the date string right away, as we will need it for grouping later.

import pandas as pd

# Load the extended Kloten dataset
fp = "kloten_summer_2022_extended.txt"
data = pd.read_csv(fp, sep=r"\s+", skiprows=[1], na_values=["-9999"])

# Extract the month (characters 4 and 5 from the YYYYMMDD string)
data["month"] = data["DATE"].astype(str).str.slice(4, 6)

display(data.head(3))

Extended Kloten Dataset (with month)

DATEtmintmaxtmeanrhwind_speedradiationmonth
02022060111.119.614.883.11.8126.706
12022060212.321.817.081.42.3231.306
22022060312.724.8NaN82.22.4198.106

1. Unique Values and Counts

Sometimes you want to know exactly what distinct categories exist within a specific column. Pandas provides two simple methods for this: .unique() to list the distinct values, and .nunique() to count them.

For example, let us confirm exactly which months are represented in our dataset.

# See the actual unique values
print("Unique months:", data["month"].unique())

# Count the number of unique values
print("Number of unique months:", data["month"].nunique())

Counting Category Frequencies

Knowing the unique categories is helpful, but usually, we also want to know how many rows belong to each category.

To answer “how many of each?”, Pandas provides the incredibly useful .value_counts() method. It instantly calculates the frequency of every unique value in a column and sorts them from most to least common.

# Count how many days of data we have for each specific month
display(data["month"].value_counts())

These three methods (unique, nunique, and value_counts) form your standard toolkit for exploring categorical data. This is incredibly useful for tasks like checking how many unique weather stations are present in a massive global dataset, or finding out which station recorded the most data points!

Concept Check: Which counting tool?

Imagine you are analyzing a dataset of 10,000 recorded bird sightings. You want to know exactly how many different species of birds were observed in total (e.g., the answer should be a single number, like 45).

Which method should you use on the species column?


2. Basic Statistics

The DataFrame and Series objects contain built-in methods for generating rapid summary statistics. Rather than writing complex math, you can simply call .mean(), .median(), .min(), .max(), and .std() (standard deviation).

If you call these on a single column, you get a single number:

# Find the absolute maximum temperature of the entire summer
hottest_temp = data["tmax"].max()
print(f"The highest temperature was {hottest_temp}°C")

If you want a comprehensive overview of your entire dataset, the .describe() method is your best friend. It generates a statistical summary for all numeric columns at once, automatically ignoring missing NaN values.

# Generate summary statistics for specific columns
columns_to_check = ["tmax", "tmin", "wind_speed"]
display(data[columns_to_check].describe().round(2))

Output of describe()

tmaxtminwind_speed
count92.0092.0092.00
mean27.0713.952.53
std4.012.370.68
min18.607.301.10
25%24.3312.702.10
50%27.2513.602.40
75%29.5015.383.00
max36.0019.404.20

3. Grouping Data

Calculating the average temperature for the entire summer is interesting, but what if we want to know the average temperature for each specific month?

Pandas uses a powerful concept called Split-Apply-Combine, accessible via the .groupby() method. This allows us to:

  1. Split the dataset into groups based on a category (like our month column).

  2. Apply a mathematical function (like mean or sum) to each group independently.

  3. Combine the results back into a neat summary table.

Let us group our data by month and calculate the average temperatures and wind speed. To understand the mechanics, we will write it out step-by-step first:

# 1. SPLIT: Group the data by the 'month' column
grouped_data = data.groupby("month")

# 2. SELECT: Isolate the columns we want to calculate
columns_to_aggregate = ["tmax", "tmin", "wind_speed"]

# 3. APPLY & COMBINE: Apply the mean() function
monthly_averages = grouped_data[columns_to_aggregate].mean()

display(monthly_averages.round(1))

Output without reset_index()

tmaxtminwind_speed
month
0625.913.12.4
0728.214.62.6
0827.114.12.6

Notice how month is sitting lower than the other column headers? That is because .groupby() automatically turns your grouping category into the DataFrame’s Index!

The Chained Approach

Once you understand the Split-Apply-Combine logic, you do not need to write three separate lines of code. Pandas is designed to “chain” these commands together.

We can also add one crucial extra command at the end: .reset_index(). By default, .groupby() turns your grouping category (month) into the index. Resetting the index turns month back into a normal, usable column!

# The elegant, one-line Pandas way:
monthly_averages = (
    data.groupby("month")[["tmax", "tmin", "wind_speed"]].mean().reset_index()
)

# Round the results to 2 decimal places for readability
display(monthly_averages.round(2))

Output of groupby and mean

monthtmaxtminwind_speed
00625.8713.102.40
10728.2114.602.62
20827.0914.142.55

With a single line of chained code, we extracted a high-level insight: July (07) was noticeably hotter and slightly windier on average than June or August!


4. Table Joins

In real-world projects, your data is rarely contained in just one file. You might have one table with daily temperature data and another table with complementary measurements like rainfall. To analyze them together, you must join them.

Pandas uses the .merge() method to combine two DataFrames. It looks for a common column (the “key”) in both tables and aligns the rows where the keys match.

Let us load a new supplementary dataset (kloten_summer_2022_rain_sun.csv) that contains daily precipitation and sunshine_duration, and merge it with our extended temperature dataset. We will use the date as our shared key.

# Load the supplementary dataset
# (Since it is a standard CSV, we don't need sep or skiprows!)
supp_fp = "kloten_summer_2022_rain_sun.csv"
supp_data = pd.read_csv(supp_fp)

# Merge the two DataFrames.
# Since the date columns have different names ('DATE' vs 'YEARMODA'),
# we use left_on and right_on to tell Pandas which keys to match!
combined_data = pd.merge(data, supp_data, left_on="DATE", right_on="YEARMODA")

# Remove duplicate key column
combined_data = combined_data.drop(columns=["YEARMODA"])

# Display a few columns from the newly merged massive dataset
columns_to_show = ["DATE", "tmax", "rh", "precipitation", "sunshine_duration"]
display(combined_data[columns_to_show].head(3))

Output of merged DataFrames

DATEtmaxrhprecipitationsunshine_duration
02022060119.683.12.274
12022060221.881.40.0281
22022060324.882.20.3333

Because both tables shared the exact same dates, Pandas neatly attached the rain and sun columns from the new table right next to the corresponding temperatures in our main table!

Concept Check: The Shared Key

You have Table A (containing city_name and population) and Table B (containing mayor_name and city_name).

If you want to merge these tables so you can see the population and the mayor side-by-side, what column name should you pass to the on= parameter in pd.merge()?


5. Exercise

It is time to put your relational data skills to the test. You have been asked to compare the summer weather of Kloten (near Zurich) with Lugano (in southern Switzerland).

Tasks:

  1. Load the Lugano_summer_2022_extended.txt file into a new DataFrame. Remember to use the same sep, skiprows, and na_values parameters.

  2. Use pd.merge() to combine your newly merged Kloten combined_data DataFrame with the Lugano DataFrame using the DATE column as the key. (Hint: Because both tables have columns named tmean and precipitation, Pandas will automatically add _x and _y to the names to tell them apart. You can control this using the suffixes parameter in your merge function, like suffixes=("_kloten", "_lugano"))

  3. Group the newly merged DataFrame by month. (If your merged DataFrame lost the month column, recreate it by slicing the DATE string!)

  4. Calculate the .mean() for tmean and precipitation for both cities.

  5. Display the final aggregated table (round your results to 1 decimal place). Which city was hotter on average in July?

# Write your code here

6. Summary

You have officially transitioned from just manipulating rows of data to generating actionable knowledge and insights.

Key takeaways

  • Unique Values: Use .unique(), .nunique(), and .value_counts() to quickly understand the distinct categories and their frequencies in your dataset.

  • Summary Statistics: Methods like .mean(), .max(), and .describe() provide an instant mathematical overview of your columns.

  • Grouping: The .groupby() method is essential for aggregating data. It splits your data by a category (like a specific month), applies a calculation, and builds a neat summary table.

  • Joining Tables: The pd.merge() function allows you to stitch separate datasets together horizontally by matching rows based on a shared key column (like a specific date).

What comes next?

Throughout this chapter, we relied on a bit of a hack: we used string slicing (.str.slice()) to chop up text and extract the month from our dates. But what if we wanted to calculate the exact number of days between two measurements? Or what if our dates were formatted messily like “Jan 1st, 2022” instead of “20220101”?

Dates are not just strings of text; they are measurable, plottable dimensions.

In the next section, Working with Temporal Data, we will learn how to convert clunky text into intelligent Pandas datetime objects. You will learn how to effortlessly extract years, months, and days using the special .dt accessor, and how to automatically resample time-series data (like converting hourly sensor readings into daily averages) without writing complex code.