---
title: "Interactive Demographic Explorer: South Sudan 2008 Census"
subtitle: "A Beginner's Guide to Data Analysis with Pandas Method Chaining and Plotly"
author: "Alierwai Reng"
date: "2024-11-20"
categories: [Python, Pandas, Interactive Visualization, Demographics, Census]
image: featured.png
description: "Learn data analysis step-by-step! This beginner-friendly tutorial explores South Sudan's demographics using pandas method chaining for clean, readable code and Plotly for interactive visualizations."
format:
html:
code-fold: true
code-tools: true
toc: true
toc-depth: 3
toc-title: "Tutorial Contents"
self-contained: true
---
## Introduction
Welcome to this hands-on data analysis tutorial! By the end of this guide, you'll understand how to:
- **Load and explore** real-world census data
- **Clean data** using pandas method chaining (a professional, readable approach)
- **Calculate** summary statistics and group-level metrics
- **Create interactive visualizations** that invite exploration
We'll analyze South Sudan's 2008 census data, but the techniques you learn apply to any dataset.
::: {.callout-tip}
## What is Method Chaining?
Method chaining connects multiple operations in a single, flowing statement—like a recipe where each step feeds into the next. Instead of creating many intermediate variables, we write clean, readable pipelines.
:::
---
## Part 1: Environment Setup
### Step 1: Import Required Libraries
Every Python analysis starts with importing the tools we need. Think of libraries as specialized toolboxes—each designed for specific tasks.
```{python}
#| label: load-packages
# Data manipulation
import pandas as pd # The core library for data analysis
import janitor # Adds helper functions like clean_names()
# Visualization
import plotly.express as px # Quick, beautiful charts
import plotly.graph_objects as go # Advanced chart customization
from plotly.subplots import make_subplots # Multi-panel dashboards
# Tables
from great_tables import GT # Professional table formatting
print("✅ All libraries loaded successfully!")
print(f"1) 📊 Pandas version: {pd.__version__};\n2) janitor version: {janitor.__version__}")
```
::: {.callout-note}
## Library Purposes
| Library | What It Does |
|---------|--------------|
| **pandas** | Load, clean, transform, and analyze data |
| **janitor** | Simplifies common cleaning tasks (like fixing column names) |
| **plotly** | Creates interactive, web-ready charts |
| **great_tables** | Makes publication-quality tables |
:::
### Step 2: Configure a Custom Theme
Consistent styling makes your visualizations look professional. We'll create a PyStatR+ branded theme.
```{python}
import plotly.io as pio
# Define custom color palette and fonts
pio.templates["pystatr"] = go.layout.Template(
layout=go.Layout(
font=dict(family="Poppins, sans-serif", size=12),
plot_bgcolor="#ffffff",
paper_bgcolor="#ffffff",
colorway=["#22d3ee", "#FFD700", "#06b6d4", "#0891b2", "#0e7490"],
title_font=dict(size=18, color="#06b6d4", family="Poppins"),
)
)
# Set as default for all charts
pio.templates.default = "pystatr"
print("🎨 Custom theme configured!")
```
---
## Part 2: Loading the Data
### Step 3: Load Census Data from URL
Real datasets often live on the web. Pandas can read directly from URLs—no download required!
```{python}
# Define the data source URL
# Using a constant (ALL_CAPS) makes it easy to find and update later
data_url = (
"https://raw.githubusercontent.com/tongakuot/r_tutorials/"
"main/06-data-wrangling/00-input/ss_2008_census_data_raw.csv"
)
# Load the data into a DataFrame
# We name it 'df_raw' to indicate this is the original, uncleaned data
df_raw = pd.read_csv(data_url)
# Confirm successful load with basic info
print("✅ Data loaded successfully!")
print(f"📋 Rows: {df_raw.shape[0]:,}")
print(f"📋 Columns: {df_raw.shape[1]}")
```
::: {.callout-tip}
## Naming Convention: snake_case
We use `snake_case` (lowercase words separated by underscores) for all variable names. This is the Python standard and makes code more readable.
- ✅ Good: `df_raw`, `gender_distribution`, `top_10_states`
- ❌ Avoid: `dfRaw`, `GenderDistribution`, `Top10States`
:::
---
## Part 3: Data Exploration
Before cleaning, we must **understand** our data. What columns exist? What types are they? Are there issues?
### Step 4: Examine Column Names
```{python}
# View the original column names
print("📝 Original column names:")
print(df_raw.columns.tolist())
```
Notice any issues? Inconsistent capitalization, spaces, or special characters make columns harder to work with.
### Step 5: Check Data Types
```{python}
# Check what type of data each column contains
print("🔍 Data types:")
print(df_raw.dtypes)
```
::: {.callout-note}
## Understanding Data Types
| Type | Meaning | Examples |
|------|---------|----------|
| `int64` | Whole numbers | 1, 42, 1000 |
| `float64` | Decimal numbers | 3.14, 99.9 |
| `object` | Text (strings) | "Male", "Juba" |
:::
### Step 6: Preview the Data
```{python}
# View the first 5 rows
# This gives us a concrete look at actual values
df_raw.head()
```
### Step 7: Check for Missing Values
```{python}
# Count missing values in each column
missing_counts = df_raw.isna().sum()
print("❓ Missing values per column:")
print(missing_counts)
print(f"\n📊 Total missing values: {missing_counts.sum()}")
```
---
## Part 4: Data Cleaning with Method Chaining
Now we transform our raw data into analysis-ready format. **Method chaining** lets us do this elegantly.
### Step 8: Clean Column Names
The `clean_names()` function from pyjanitor standardizes all column names to `snake_case`.
```{python}
# Clean column names using method chaining
# The dot (.) connects operations into a pipeline
df = (
df_raw
# .clean_names() # Converts to snake_case, removes special characters
.rename(columns=lambda c: str(c).lower().replace(" ", "_"))
)
print("✅ Column names cleaned!")
print("📝 New column names:")
print(df.columns.tolist())
```
::: {.callout-tip}
## Why Method Chaining?
Compare these approaches:
**Without chaining (harder to read):**
```python
df = df_raw.copy()
df = df.clean_names()
df = df.drop_duplicates()
df = df.reset_index(drop=True)
```
**With chaining (clean and readable):**
```python
df = (
df_raw
.rename(columns=lambda c: str(c).lower().replace(" ", "_"))
.drop_duplicates()
.reset_index(drop=True)
)
```
The chained version reads like a recipe—each step flows naturally to the next.
:::
### Step 9: Identify String Columns
Before cleaning text values, we need to know which columns contain text.
```{python}
# Find columns with string (object) data type
string_columns = df.select_dtypes(include=['str']).columns.tolist()
print(f"📝 String columns to clean: {string_columns}")
```
### Step 10: Clean String Values
Text data often has inconsistencies: extra spaces, mixed capitalization. Let's fix that!
```{python}
# Clean all string columns using method chaining
# - .str.strip() removes leading/trailing whitespace
# - .str.title() capitalizes first letter of each word
df_cleaned = (
df
.assign(**{
col: df[col].str.strip().str.title()
for col in string_columns
})
)
print("✅ String values cleaned!")
df_cleaned.head()
```
::: {.callout-note}
## What's Happening in `.assign()`?
The `assign()` method creates or modifies columns. The `**{...}` syntax unpacks a dictionary, letting us apply the same operation to multiple columns at once.
This is equivalent to:
```python
df['state'] = df['state'].str.strip().str.title()
df['gender'] = df['gender'].str.strip().str.title()
# ... and so on
```
:::
### Step 11a: Ensure Correct Data Types
Sometimes numeric columns get read as text. Let's ensure `population` is numeric.
```{python}
# Convert population to numeric type
# errors='coerce' turns invalid values into NaN instead of crashing
df_cleaned = (
df_cleaned
.assign(population=lambda d: d["2008"].round().astype("Int64"))
)
print(f"✅ Population column type: {df_cleaned['population'].dtype}")
print(f"❓ Missing values after conversion: {df_cleaned['population'].isna().sum()}")
```
### Step 11b: Use the new pandas pd.col() method
```{python}
# Convert population to numeric type
# errors='coerce' turns invalid values into NaN instead of crashing
df_pd_col = (
df_raw.assign(
population=pd.col("2008").round().astype("Int64")
)
)
print(f"✅ Population column type: {df_pd_col['population'].dtype}")
print(f"❓ Missing values after conversion: {df_pd_col['population'].isna().sum()}")
```
### Step 12: Drop Missing Values and Remove Aggregated Rows
Before analysis, we need to:
1. **Remove rows with missing population values** — incomplete data skews results
2. **Filter out "Total" rows** — these are pre-calculated aggregates that cause double-counting
```{python}
# Check current shape before cleaning
print(f"📊 Before cleaning: {df_cleaned.shape[0]:,} rows")
print(f"❓ Rows with missing population: {df_cleaned['population'].isna().sum()}")
```
```{python}
# Preview what's in variable_name to understand the data
print("🔍 Unique values in variable_name:")
print(df_cleaned['variable_name'].unique())
```
**Method 1: Using `.dropna()` and `.query()` (Recommended)**
```{python}
# Drop NAs and filter out Total rows using method chaining
df_cleaned = (
df_cleaned
.dropna(subset=['population']) # Remove rows where population is NaN
.query("~variable_name.str.contains('Total', case=False)") # Remove Total rows
.reset_index(drop=True)
)
print(f"✅ After cleaning: {df_cleaned.shape[0]:,} rows")
```
**Method 2: Using `.dropna()` and `.loc[]` (Alternative)**
```{python}
# Equivalent approach using .loc[] (shown for learning, not executed)
dropna_df = (
df_cleaned
.dropna(subset=['population'])
.loc[lambda x: ~x['variable_name'].str.contains('Total', case=False)]
.reset_index(drop=True)
)
```
::: {.callout-tip}
## Why Remove "Total" Rows?
Census data often includes both **detail rows** (Male, Female) and **aggregated Total rows**.
If we keep Totals when calculating our own summaries:
```
Male: 100 + Female: 100 + Total: 200 = 400 ❌ (should be 200)
```
By removing Total rows first, we can calculate accurate statistics from scratch.
:::
```{python}
# Verify only Male/Female remain
print("📝 Remaining values in variable_name:")
print(df_cleaned['variable_name'].unique())
```
::: {.callout-note}
## Variable Progression in This Tutorial
As we clean our data, we use descriptive variable names to track the state:
- `df_raw` → Original data loaded from URL
- `df` → Column names cleaned (snake_case)
- `df_cleaned` → String values cleaned + data types fixed
- `census_df` → Final dataset with extracted gender column (using method chaining)
- `df` → Reassigned for convenient analysis
Notice how we use **method chaining** to go directly from `df_cleaned` to `census_df` without intermediate copies—this is the power of pandas method chaining!
:::
### Step 12: String Splitting - Extracting Gender Information
Now let's explore the `variable_name` column and extract meaningful information from it. This column contains structured text like "Population - Male (Number)" that we can split and parse.
```{python}
# Examine what's in the variable_name column
print("🔍 Unique values in variable_name column:")
print(df_cleaned['variable_name'].unique())
```
::: {.callout-tip}
## String Splitting Strategy
The `variable_name` column follows a pattern: "Category - Gender (Type)"
We want to extract the middle piece (Male, Female, Total) using various pandas string methods. Each approach has different use cases!
:::
#### Method 1: Using `.str.split()` with Expand
The simplest approach: split on delimiter and select the piece you want.
```{python}
# Split on ' - ' and extract the second part (index 1)
df_method_1a = (
df_cleaned
.assign(
# Split creates columns; we select the second one (index 1)
gender_extracted=lambda x: x['variable_name'].str.split(' ', expand=True).get(1)
)
)
print("✅ Method 1: Basic split with expand")
print(df_method_1a[['variable_name', 'gender_extracted']].drop_duplicates())
```
```{python}
# Split on ' - ' and extract the second part (index 1)
df_method_1b = (
df_cleaned
.assign(
# Split creates columns; we select the second one (index 1)
gender_extracted=lambda x: x['variable_name'].str.split(' ', expand=True)[1]
)
)
print("✅ Method 1: Basic split with expand")
print(df_method_1b[['variable_name', 'gender_extracted']].drop_duplicates())
```
::: {.callout-note}
## How `.str.split()` Works
- `split(' ')` breaks the string at each occurrence of ' '
- `expand=True` creates separate columns (like a DataFrame)
- `.get(1)` | `[1]` selects the second column (Python uses 0-based indexing)
:::
#### Method 2: Chain Multiple Splits
Sometimes you need to clean up the extracted text further.
```{python}
# Split twice: first on ' - ', then remove the parenthetical part
df_method_2 = (
df_cleaned
.assign(
# First split to get middle section
temp_extract=lambda x: x['variable_name'].str.split(' ', expand=True)[1],
# Then split on ' (' to remove the (Number) part
gender_clean=lambda x: x['temp_extract'].str.split(' \\(', expand=True)[0]
)
.drop(columns=['temp_extract']) # Remove temporary column
)
print("✅ Method 3: Chained splits for clean extraction")
print(df_method_2[['variable_name', 'gender_clean']].drop_duplicates())
```
#### Method 3: Using `.str.extract()` with Regex
For more complex patterns, regular expressions provide powerful extraction.
```{python}
# Use regex pattern to extract text between ' - ' and ' ('
df_method_3 = (
df_cleaned
.assign(
gender_regex=lambda x: x['variable_name'].str.extract(r' (\w+) \(', expand=False)
)
)
print("✅ Method 3: Regex extraction")
print(df_method_3[['variable_name', 'gender_regex']].drop_duplicates())
```
::: {.callout-note}
## Understanding the Regex Pattern
`r' (\w+) \('` breaks down as:
- `r` = raw string (treats backslashes literally)
- ` ` = matches literal " "
- `(\w+)` = captures one or more word characters (letters/digits)
- ` \(` = matches literal " (" (backslash escapes the parenthesis)
The parentheses `()` create a "capture group" that `.extract()` returns.
:::
#### Method 4: Using `.str.replace()` and String Slicing
Remove unwanted parts and extract what remains.
```{python}
# Remove the prefix and suffix to isolate the gender
df_method_4 = (
df_cleaned
.assign(
gender_replace=lambda x: (
x['variable_name']
.str.replace('Population, ', '', regex=False) # Remove prefix
.str.replace(r' \(Number\)', '', regex=True) # Remove suffix
)
)
)
print("✅ Method 4: String replacement")
print(df_method_4[['variable_name', 'gender_replace']].drop_duplicates())
```
#### Method 5: One-liner with Method Chaining
Combine everything into a compact, readable pipeline. This creates our final cleaned dataframe: `census_df`.
```{python}
# The most elegant solution: all-in-one chained extraction
# This is our final cleaned dataset with extracted gender information
census_df = (
df_cleaned
.assign(
gender=lambda x: (
x['variable_name']
.str.split(' ', expand=True)[1] # Extract middle piece
.str.split(' \(', expand=True)[0] # Remove parenthetical
.str.strip() # Remove any extra whitespace
)
)
)
print("✅ Method 5: Complete chained solution")
print(census_df[['variable_name', 'gender']].drop_duplicates())
# Verify the extraction worked
print("\n🎯 Unique gender values extracted:")
print(census_df['gender'].unique())
```
::: {.callout-tip}
## Which Method Should You Use?
**Choose based on your needs:**
| Method | Best For | Pros | Cons |
|--------|----------|------|------|
| **Method 1** | Simple, consistent patterns | Fast, readable | Requires fixed structure |
| **Method 2** | Multi-step cleaning | Very clear logic | More verbose |
| **Method 3** | Complex patterns | Most flexible | Requires regex knowledge |
| **Method 4** | Known prefix/suffix | Intuitive | Less robust to variations |
| **Method 5** | Production code | Compact, professional | Harder for beginners |
For this tutorial, **Method 5** is ideal: it's clean, readable, and professional.
:::
### Step 13: Verify the Extraction
Let's confirm our extracted gender values match what we expect.
```{python}
# Check the distribution of gender values
gender_counts = (
census_df
.groupby('gender', as_index=False)
.agg(
count=('gender', 'size'),
total_population=('population', 'sum')
)
.sort_values('total_population', ascending=False)
)
print("📊 Gender distribution in extracted data:")
print(gender_counts)
```
### Step 14: Prepare Data for Analysis
Let's assign our cleaned dataframe with the extracted gender column back to `df` for convenience in subsequent analyses:
```{python}
# Use the cleaned dataframe with extracted gender column for all subsequent analyses
print("✅ Ready for analysis!")
print(f"📋 Final dataset: {census_df.shape[0]:,} rows × {df.shape[1]} columns")
```
### Step 15: Final Cleaned Dataset Summary
```{python}
# Summary of our cleaned data with extracted gender column
print("=" * 50)
print("🎉 DATA CLEANING COMPLETE!")
print("=" * 50)
print(f"📋 Final shape: {census_df.shape[0]:,} rows × {census_df.shape[1]} columns")
print(f"📝 Columns: {census_df.columns.tolist()}")
print("\n🔍 Preview:")
df.head()
```
---
## Part 5: Filtering Data — Two Approaches
Before diving into analysis, let's learn how to **filter** (subset) data. Pandas offers two main approaches: `.loc[]` and `.query()`. Both are powerful—choose based on readability and context.
### Step 16: Filtering with `.loc[]`
The `.loc[]` accessor filters rows using **boolean conditions**. It's explicit and works well for simple filters.
```{python}
# Filter rows where gender is 'Male'
# .loc[condition] returns rows where condition is True
males_only = census_df.loc[census_df['gender'] == 'Male']
print(f"📊 Total rows: {len(df):,}")
print(f"👨 Male rows: {len(males_only):,}")
males_only.head()
```
```{python}
# Rename columns
census_df = (
census_df
.rename(columns={
'region_name': 'state',
'age_name': 'age_category'
}
)
.query("age_category != 'Total'")
# select columns of interest
[['state', 'gender', 'age_category', 'population']]
)
# Inspect output
print(census_df)
```
```{python}
# Multiple conditions with .loc[]
# Use & (and), | (or), ~ (not)
# IMPORTANT: Wrap each condition in parentheses!
# Filter: Males in 'Jonglei' state
males_in_jonglei = census_df.loc[
((census_df['gender'] == 'Male') & (census_df['state'] == 'Jonglei'))
]
print(f"👨 Males in Jonglei: {len(males_in_jonglei):,} rows")
males_in_jonglei.head()
```
::: {.callout-note}
## `.loc[]` Syntax Rules
| Operation | Syntax |
|-----------|--------|
| Equals | `df['col'] == 'value'` |
| Not equals | `df['col'] != 'value'` |
| Greater than | `df['col'] > 100` |
| Multiple AND | `(cond1) & (cond2)` |
| Multiple OR | `(cond1) \| (cond2)` |
| NOT | `~(condition)` |
:::
### Step 17: Filtering with `.query()`
The `.query()` method uses a **string expression**—often more readable, especially for complex filters.
```{python}
# Same filter using .query() - cleaner syntax!
males_only_query = census_df.query("gender == 'Male'")
print(f"👨 Male rows (via query): {len(males_only_query):,}")
males_only_query.head()
```
```{python}
# Multiple conditions with .query() - much cleaner!
# No parentheses needed, use 'and', 'or', 'not'
males_in_jonglei_query = census_df.query("gender == 'Male' and state == 'Jonglei'")
print(f"👨 Males in Jonglei (via query): {len(males_in_jonglei_query):,} rows")
males_in_jonglei_query.head()
```
### Step 18: Using Variables in `.query()` with `@`
A powerful feature: reference Python variables inside query strings using `@`.
```{python}
# Define filter criteria as variables
target_state = 'Upper Nile'
min_population = 1000
# Use @ to reference variables inside the query string
filtered_df = census_df.query("state == @target_state and population >= @min_population")
print(f"🎯 Rows matching criteria: {len(filtered_df):,}")
filtered_df.head()
```
```{python}
# Filter using a list of values with 'in'
states_of_interest = ['Jonglei', 'Upper Nile', 'Unity']
# .loc approach - use .isin()
result_loc = census_df.loc[census_df['state'].isin(states_of_interest)]
# .query approach - use 'in @list'
result_query = census_df.query("state in @states_of_interest")
print(f"📊 Results match: {len(result_loc) == len(result_query)}")
print(f"📋 Rows found: {len(result_query):,}")
```
::: {.callout-tip}
## When to Use Each Method
| Use `.loc[]` When... | Use `.query()` When... |
|---------------------|----------------------|
| Simple, single conditions | Complex, multi-condition filters |
| You need to select specific columns too | Readability is priority |
| Working with column names that have spaces | Filter logic reads like English |
| Programmatic/dynamic column selection | Using variables with `@` |
**In method chains**, `.query()` is often cleaner:
```python
# Clean and readable
result = (
df
.query("state == 'Jonglei' and gender == 'Male'")
.groupby('age_category')
.agg(population=('population', 'sum'))
)
```
:::
### Step 19: Combining `.loc[]` with Column Selection
A unique advantage of `.loc[]`: select rows AND columns simultaneously.
```{python}
# Select specific rows AND columns with .loc[rows, columns]
subset = census_df.loc[
census_df['state'] == 'Jonglei', # Row condition
['state', 'gender', 'population'] # Columns to keep
]
print("📊 Filtered rows with selected columns:")
subset.head()
```
```{python}
# Equivalent with .query() + column selection
subset_query = (
census_df
.query("state == 'Jonglei'")
[['state', 'gender', 'population']] # Column selection after query
)
subset_query.head()
```
---
## Part 6: Summary Statistics
Now that our data is clean, let's calculate some high-level metrics.
### Step 20: Calculate Overview Metrics
```{python}
# Build a summary table of key statistics
summary_stats = pd.DataFrame({
'metric': [
'Total Population',
'Number of States',
'Age Categories',
'Gender Groups'
],
'value': [
census_df['population'].sum(),
census_df['state'].nunique(),
census_df['age_category'].nunique(),
census_df['gender'].nunique()
]
})
summary_stats
```
### Step 21: Display as Formatted Table
Let's make this look professional with Great Tables.
```{python}
(
GT(summary_stats)
.tab_header(
title="South Sudan 2008 Census Overview",
subtitle="Key Summary Statistics"
)
.fmt_number(
columns="value",
rows=[0], # Only format the first row (population)
decimals=0,
use_seps=True
)
.cols_label(
metric="Metric",
value="Value"
)
.cols_align(align="center")
)
```
### Step 22: Age Category Cleaning
#### Method 1: Using .replace() with a mapping dictionary
```{python}
# Define the age category mapping (matching actual data format: '0 To 4')
age_mapping = {
'0 To 4': '0-14',
'5 To 9': '0-14',
'10 To 14': '0-14',
'15 To 19': '15-24',
'20 To 24': '15-24',
'25 To 29': '25-34',
'30 To 34': '25-34',
'35 To 39': '35-44',
'40 To 44': '35-44',
'45 To 49': '45-54',
'50 To 54': '45-54',
'55 To 59': '55-64',
'60 To 64': '55-64',
'65+': '65+' # Already in correct format
}
# Apply the mapping using .replace()
census_new_age_categories = (
census_df
.assign(
age_category=lambda x: x['age_category'].replace(age_mapping)
)
)
print("✅ Age categories standardized!")
print("\n🎯 New age categories:")
print(sorted(census_new_age_categories['age_category'].unique()))
```
#### Method 2: Using case_when() (Pandas 2.2+)
```{python}
## Using case_when for conditional transformations
# case_when is called on a Series instance, not as a class method
census_clean = (
census_df
.assign(
age_category=lambda x: x['age_category'].case_when(
caselist=[
(x['age_category'].isin(['0 To 4', '5 To 9', '10 To 14']), '0-14'),
(x['age_category'].isin(['15 To 19', '20 To 24']), '15-24'),
(x['age_category'].isin(['25 To 29', '30 To 34']), '25-34'),
(x['age_category'].isin(['35 To 39', '40 To 44']), '35-44'),
(x['age_category'].isin(['45 To 49', '50 To 54']), '45-54'),
(x['age_category'].isin(['55 To 59', '60 To 64']), '55-64'),
(x['age_category'] == '65+', '65+'),
]
)
)
)
print("✅ Age categories standardized using case_when!")
print("\n🎯 New age categories:")
print(sorted(census_clean['age_category'].unique()))
```
---
## Part 7: Gender Analysis
Let's explore how the population breaks down by gender.
### Step 22: Calculate National Gender Distribution
This demonstrates a complete method chaining pipeline:
```{python}
# Aggregate population by gender
# This pipeline: groups → sums → sorts → calculates percentage
gender_distribution = (
census_clean
.groupby('gender', as_index=False) # Group rows by gender
.agg(population=('population', 'sum')) # Sum population in each group
.sort_values('population', ascending=False) # Sort largest first
.assign(
percentage=lambda x: (x['population'] / x['population'].sum() * 100).round(2)
)
.reset_index(drop=True) # Clean up the index
)
gender_distribution
```
::: {.callout-note}
## Breaking Down the Pipeline
| Step | What It Does |
|------|--------------|
| `.groupby('gender')` | Creates groups: one for Male, one for Female |
| `.agg(population=('population', 'sum'))` | Sums the population column within each group |
| `.sort_values(...)` | Reorders rows by population (descending) |
| `.assign(percentage=...)` | Adds a new column with calculated percentages |
| `.reset_index(drop=True)` | Creates a clean 0, 1, 2... index |
:::
### Step 23: Display Gender Table
```{python}
(
GT(gender_distribution)
.tab_header(
title="National Gender Distribution",
subtitle="Population and Percentages"
)
.fmt_number(columns="population", decimals=0, use_seps=True)
.fmt_number(columns="percentage", decimals=2)
.cols_label(
gender="Gender",
population="Population",
percentage="Percentage (%)"
)
)
```
### Step 24: Visualize with Interactive Donut Chart
Interactive charts let viewers explore data themselves—hover, click, zoom!
```{python}
#| fig-width: 10
#| fig-height: 6
# Create a donut chart (pie chart with a hole)
fig_gender = px.pie(
gender_distribution,
values='population',
names='gender',
title='Gender Distribution Across South Sudan',
color='gender',
color_discrete_map={'Male': '#22d3ee', 'Female': '#FFD700'},
hole=0.4 # Creates the donut hole
)
# Customize appearance
fig_gender.update_traces(
textposition='inside',
textinfo='percent+label',
marker=dict(line=dict(color='#ffffff', width=2))
)
fig_gender.update_layout(
title_font_size=18,
title_font_color="#06b6d4",
height=500,
legend=dict(
orientation="h",
yanchor="bottom",
y=-0.15,
xanchor="center",
x=0.5
)
)
fig_gender.show()
```
---
## Part 8: State-Level Gender Analysis
Now let's dig deeper—how does gender distribution vary by state?
### Step 25: Aggregate by State and Gender
```{python}
# Step 18a: Group by both state AND gender
state_gender_long = (
census_clean
.groupby(['state', 'gender'], as_index=False)
.agg(population=('population', 'sum'))
)
print("📊 Long format (each state-gender combo is one row):")
state_gender_long.head(10)
```
### Step 26: Pivot to Wide Format
Sometimes we need data in "wide" format—one row per state, separate columns for each gender.
```{python}
# Pivot: rows become states, columns become genders
state_gender_wide = (
state_gender_long
.pivot(index='state', columns='gender', values='population')
.reset_index()
# .clean_names() # Standardize the new column names
.rename(columns=lambda c: c.lower())
)
print("📊 Wide format (one row per state):")
state_gender_wide.head()
```
### Step 27: Calculate Derived Metrics
Now we can calculate useful metrics like gender ratio.
```{python}
# Add calculated columns
state_gender = (
state_gender_wide
.assign(
total=lambda x: x['male'] + x['female'],
male_pct=lambda x: (x['male'] / x['total'] * 100).round(2),
female_pct=lambda x: (x['female'] / x['total'] * 100).round(2),
gender_ratio=lambda x: (x['male'] / x['female'] * 100).round(2) # Males per 100 females
)
.sort_values('total', ascending=False)
.reset_index(drop=True)
)
print("📊 State-level gender analysis:")
state_gender.head(10)
```
::: {.callout-note}
## Understanding Gender Ratio
A gender ratio of 100 means equal males and females.
- Above 100 = more males
- Below 100 = more females
:::
### Step 28: Display Top 10 States Table
```{python}
top_5_states = state_gender.head(5)
(
GT(top_5_states)
.tab_header(
title="Top 5 States by Population",
subtitle="Gender Breakdown and Ratios"
)
.fmt_number(columns=['male', 'female', 'total'], decimals=0, use_seps=True)
.fmt_number(columns=['male_pct', 'female_pct', 'gender_ratio'], decimals=2)
.cols_label(
state="State",
male="Male",
female="Female",
total="Total",
male_pct="Male %",
female_pct="Female %",
gender_ratio="Ratio (M/100F)"
)
.data_color(
columns="gender_ratio",
palette=["#FFD700", "#ffffff", "#22d3ee"]
)
)
```
### Step 29: Visualize State Gender Distribution
```{python}
#| fig-width: 12
#| fig-height: 8
# Reshape data for plotting (back to long format)
state_gender_plot = (
state_gender
.head(5)
.melt(
id_vars=['state'],
value_vars=['male', 'female'],
var_name='gender',
value_name='population'
)
.assign(gender=lambda x: x['gender'].str.title()) # Capitalize for display
)
# Create grouped bar chart
fig_state_gender = px.bar(
state_gender_plot,
x='state',
y='population',
color='gender',
title='Population by State and Gender (Top 5 States)',
labels={'population': 'Population', 'state': 'State'},
color_discrete_map={'Male': '#22d3ee', 'Female': '#FFD700'},
barmode='group'
)
fig_state_gender.update_layout(
title_font_size=18,
title_font_color="#06b6d4",
xaxis_tickangle=-45,
height=600,
hovermode='x unified',
legend=dict(
title="Gender",
orientation="h",
yanchor="bottom",
y=1.02,
xanchor="right",
x=1
)
)
fig_state_gender.update_traces(
hovertemplate='<b>%{x}</b><br>Population: %{y:,}<extra></extra>'
)
fig_state_gender.show()
```
---
## Part 9: Age Category Analysis
### Step 30: Calculate National Age Distribution
```{python}
# Aggregate by age category
age_distribution = (
census_clean
.groupby('age_category', as_index=False)
.agg(population=('population', 'sum'))
.sort_values('population', ascending=False)
.assign(
percentage=lambda x: (x['population'] / x['population'].sum() * 100).round(2)
)
.reset_index(drop=True)
)
age_distribution
```
### Step 31: Display Age Distribution Table
```{python}
(
GT(age_distribution)
.tab_header(
title="Population Distribution by Age Category",
subtitle="National Summary"
)
.fmt_number(columns="population", decimals=0, use_seps=True)
.fmt_number(columns="percentage", decimals=2)
.cols_label(
age_category="Age Category",
population="Population",
percentage="Percentage (%)"
)
.data_color(
columns="percentage",
palette=["#000000", "#0891b2", "#22d3ee", "#FFD700"]
)
)
```
### Step 32: Visualize Age Distribution
```{python}
#| fig-width: 12
#| fig-height: 6
# Create horizontal bar chart (easier to read category names)
fig_age = px.bar(
age_distribution.sort_values('population'), # Sort for visual clarity
x='population',
y='age_category',
orientation='h',
title='Population Distribution by Age Category',
labels={'population': 'Population', 'age_category': 'Age Category'},
color='population',
color_continuous_scale=['#000000', '#22d3ee', '#FFD700']
)
fig_age.update_layout(
title_font_size=18,
title_font_color="#06b6d4",
height=500,
showlegend=False
)
fig_age.update_traces(
hovertemplate='<b>%{y}</b><br>Population: %{x:,}<extra></extra>'
)
fig_age.show()
```
---
## Part 10: Age and Gender Interaction
The population pyramid is a classic demographic visualization.
### Step 33: Prepare Age-Gender Cross-Tabulation
```{python}
# Aggregate by both age and gender
age_gender_long = (
census_clean
.groupby(['age_category', 'gender'], as_index=False)
.agg(population=('population', 'sum'))
)
# Pivot and calculate metrics
age_gender = (
age_gender_long
.pivot(index='age_category', columns='gender', values='population')
.reset_index()
# .clean_names()
.rename(columns=lambda c: c.lower())
.assign(
total=lambda x: x['male'] + x['female'],
female_pct=lambda x: (x['female'] / x['total'] * 100).round(2)
)
.sort_values('total', ascending=False)
.reset_index(drop=True)
)
age_gender
```
### Step 34: Create Population Pyramid
A population pyramid shows males on the left, females on the right—a powerful way to visualize demographic structure.
```{python}
#| fig-width: 12
#| fig-height: 7
# Prepare data: males get negative values (to extend left)
pyramid_data = (
age_gender_long
.assign(
population_adjusted=lambda x: x.apply(
lambda row: -row['population'] if row['gender'] == 'Male' else row['population'],
axis=1
)
)
)
# Separate by gender
male_pyramid = pyramid_data.query("gender == 'Male'")
female_pyramid = pyramid_data.query("gender == 'Female'")
# Build the pyramid
fig_pyramid = go.Figure()
# Male bars (left side)
fig_pyramid.add_trace(
go.Bar(
y=male_pyramid['age_category'],
x=male_pyramid['population_adjusted'],
name='Male',
orientation='h',
marker=dict(color='#22d3ee'),
hovertemplate='<b>Male - %{y}</b><br>Population: %{customdata:,}<extra></extra>',
customdata=male_pyramid['population']
)
)
# Female bars (right side)
fig_pyramid.add_trace(
go.Bar(
y=female_pyramid['age_category'],
x=female_pyramid['population'],
name='Female',
orientation='h',
marker=dict(color='#FFD700'),
hovertemplate='<b>Female - %{y}</b><br>Population: %{x:,}<extra></extra>'
)
)
fig_pyramid.update_layout(
title='Population Pyramid: Age and Gender Distribution',
title_font_size=18,
title_font_color="#06b6d4",
xaxis=dict(
title='Population',
tickvals=[-400000, -300000, -200000, -100000, 0, 100000, 200000, 300000, 400000],
ticktext=['400K', '300K', '200K', '100K', '0', '100K', '200K', '300K', '400K']
),
yaxis=dict(title='Age Category'),
barmode='overlay',
height=600,
bargap=0.1,
legend=dict(
orientation="h",
yanchor="bottom",
y=1.02,
xanchor="right",
x=1
)
)
fig_pyramid.show()
```
---
## Part 11: State Rankings
### Step 35: Create State Population Rankings
```{python}
# Calculate comprehensive state rankings
state_ranking = (
census_clean
.groupby('state', as_index=False)
.agg(population=('population', 'sum'))
.sort_values('population', ascending=False)
.assign(
rank=lambda x: range(1, len(x) + 1),
percentage=lambda x: (x['population'] / x['population'].sum() * 100).round(2),
cumulative_pct=lambda x: x['percentage'].cumsum().round(2)
)
.reset_index(drop=True)
)
state_ranking.head(5)
```
### Step 36: Display Top 10 States Table
```{python}
top_5 = state_ranking.head(5)
(
GT(top_5)
.tab_header(
title="Top 5 Most Populous States",
subtitle="Population, Share, and Cumulative Distribution"
)
.fmt_number(columns="population", decimals=0, use_seps=True)
.fmt_number(columns=["percentage", "cumulative_pct"], decimals=2)
.cols_label(
rank="Rank",
state="State",
population="Population",
percentage="% of Total",
cumulative_pct="Cumulative %"
)
.data_color(
columns="population",
palette=["#000000", "#0891b2", "#22d3ee", "#FFD700"]
)
)
```
### Step 37: Visualize with Treemap
Treemaps show hierarchical data as nested rectangles—size represents magnitude.
```{python}
#| fig-width: 10
#| fig-height: 6
fig_treemap = px.treemap(
top_5,
path=['state'],
values='population',
title='State Population Distribution (Top 10)',
color='population',
color_continuous_scale=['#000000', '#0891b2', '#22d3ee', '#FFD700']
)
fig_treemap.update_traces(
textposition="middle center",
texttemplate="<b>%{label}</b><br>%{value:,}",
hovertemplate='<b>%{label}</b><br>Population: %{value:,}<br>Share: %{percentRoot:.2%}<extra></extra>'
)
fig_treemap.update_layout(
title_font_size=18,
title_font_color="#06b6d4",
height=600
)
fig_treemap.show()
```
---
## Part 12: State-Level Age Distribution
### Step 38: Analyze Age in Top 5 States
```{python}
# Get top 5 states by population
top_5_states = state_ranking.head(5)['state'].tolist()
print(f"📍 Top 5 states: {top_5_states}")
```
```{python}
# Calculate age distribution within each top state
state_age_distribution = (
census_clean
.query("state in @top_5_states") # Filter to top 5 only
.groupby(['state', 'age_category'], as_index=False)
.agg(population=('population', 'sum'))
.assign(
percentage=lambda x: (
x.groupby('state')['population']
.transform(lambda pop: pop / pop.sum() * 100)
.round(2)
)
)
)
state_age_distribution.head(10)
```
::: {.callout-tip}
## The `query()` Method
`query("state in @top_5_states")` is a readable way to filter data. The `@` symbol references a Python variable inside the query string.
This is equivalent to:
```python
df[df['state'].isin(top_5_states)]
```
:::
### Step 39: Visualize State Age Distribution
```{python}
#| fig-width: 12
#| fig-height: 8
fig_state_age = px.bar(
state_age_distribution,
x='state',
y='population',
color='age_category',
title='Age Distribution in Top 5 States',
labels={
'population': 'Population',
'state': 'State',
'age_category': 'Age Category'
},
text='percentage',
color_discrete_sequence=['#000000', '#0891b2', '#22d3ee', '#06b6d4', '#FFD700', '#f59e0b']
)
fig_state_age.update_traces(
texttemplate='%{text:.1f}%',
textposition='inside',
hovertemplate='<b>%{x}</b><br>%{fullData.name}<br>Population: %{y:,}<br>Percentage: %{text:.2f}%<extra></extra>'
)
fig_state_age.update_layout(
title_font_size=18,
title_font_color="#06b6d4",
xaxis_tickangle=-45,
height=600,
barmode='stack',
legend=dict(
title="Age Category",
orientation="v",
yanchor="middle",
y=0.5,
xanchor="left",
x=1.02
)
)
fig_state_age.show()
```
---
## Part 13: Comprehensive Dashboard
Let's combine everything into a single, multi-panel dashboard.
### Step 40: Build Multi-Panel Dashboard
```{python}
#| fig-width: 14
#| fig-height: 10
# Filter to top 5 states
top_5_df = census_clean.query("state in @top_5_states")
# Create 2x2 subplot grid
fig_dashboard = make_subplots(
rows=2, cols=2,
subplot_titles=(
'Total Population by State',
'Gender Distribution',
'Age Distribution',
'Gender Ratio (Males per 100 Females)'
),
specs=[
[{"type": "bar"}, {"type": "bar"}],
[{"type": "bar"}, {"type": "bar"}]
],
vertical_spacing=0.12,
horizontal_spacing=0.1
)
# ----- Panel 1: Total Population -----
panel1_data = (
top_5_df
.groupby('state', as_index=False)
.agg(population=('population', 'sum'))
.sort_values('population', ascending=False)
)
fig_dashboard.add_trace(
go.Bar(
x=panel1_data['state'],
y=panel1_data['population'],
marker_color='#22d3ee',
name='Total Pop',
showlegend=False
),
row=1, col=1
)
# ----- Panel 2: Gender Distribution -----
panel2_data = (
top_5_df
.groupby(['state', 'gender'], as_index=False)
.agg(population=('population', 'sum'))
)
for gender, color in [('Male', '#22d3ee'), ('Female', '#FFD700')]:
gender_data = panel2_data.query("gender == @gender")
fig_dashboard.add_trace(
go.Bar(
x=gender_data['state'],
y=gender_data['population'],
name=gender,
marker_color=color
),
row=1, col=2
)
# ----- Panel 3: Age Distribution (Stacked) -----
panel3_data = (
top_5_df
.groupby(['state', 'age_category'], as_index=False)
.agg(population=('population', 'sum'))
)
age_categories = panel3_data['age_category'].unique()
colors = ['#000000', '#0891b2', '#22d3ee', '#06b6d4', '#FFD700']
for i, age_cat in enumerate(age_categories[:5]):
age_data = panel3_data.query("age_category == @age_cat")
fig_dashboard.add_trace(
go.Bar(
x=age_data['state'],
y=age_data['population'],
name=age_cat,
marker_color=colors[i % len(colors)],
showlegend=False
),
row=2, col=1
)
# ----- Panel 4: Gender Ratio -----
panel4_data = (
top_5_df
.groupby(['state', 'gender'], as_index=False)
.agg(population=('population', 'sum'))
.pivot(index='state', columns='gender', values='population')
.reset_index()
# .clean_names()
.rename(columns=lambda c: c.lower())
.assign(
ratio=lambda x: (x['male'] / x['female'] * 100).round(2)
)
)
fig_dashboard.add_trace(
go.Bar(
x=panel4_data['state'],
y=panel4_data['ratio'],
marker_color='#FFD700',
name='Gender Ratio',
showlegend=False
),
row=2, col=2
)
# ----- Layout -----
fig_dashboard.update_layout(
title_text="Comprehensive Demographic Dashboard: Top 5 States",
title_font_size=20,
title_font_color="#06b6d4",
height=900,
showlegend=True,
legend=dict(
orientation="h",
yanchor="bottom",
y=-0.15,
xanchor="center",
x=0.5
)
)
fig_dashboard.update_xaxes(tickangle=-45)
fig_dashboard.update_yaxes(title_text="Population", row=1, col=1)
fig_dashboard.update_yaxes(title_text="Population", row=1, col=2)
fig_dashboard.update_yaxes(title_text="Population", row=2, col=1)
fig_dashboard.update_yaxes(title_text="Ratio", row=2, col=2)
fig_dashboard.show()
```
---
## Key Insights
::: {.callout-important}
## What the Data Tells Us
### 1. Population Concentration
The top five states account for a significant portion of the national population. This geographic concentration should inform resource allocation decisions.
### 2. Youth Demographics
The age distribution reveals a young population—typical of developing nations. This "youth bulge" represents both opportunity (workforce potential) and challenge (education, employment needs).
### 3. Gender Balance
Most states show relatively balanced gender distributions, with some variation that may reflect migration patterns, conflict impacts, or data collection methodology.
### 4. Regional Disparities
Substantial population differences between states suggest the need for differentiated development strategies rather than uniform national approaches.
:::
---
## Conclusion
Congratulations! You've completed a full demographic analysis using professional data science techniques.
**What You Learned:**
✅ Loading data from URLs with `pd.read_csv()`
✅ Exploring data structure with `.head()`, `.dtypes`, `.isna()`
✅ Cleaning data with method chaining and pyjanitor
✅ **Filtering with `.loc[]`** — explicit boolean conditions
✅ **Filtering with `.query()`** — readable string expressions
✅ Aggregating with `.groupby()` and `.agg()`
✅ Reshaping with `.pivot()` and `.melt()`
✅ Creating interactive visualizations with Plotly
✅ Building multi-panel dashboards
**The techniques you've learned apply to ANY dataset**—from business analytics to scientific research.
::: {.callout-tip}
## Next Steps
1. Download the code and try it with your own data
2. Experiment with different Plotly chart types
3. Explore pandas' full method chaining capabilities
4. Practice! The best way to learn is by doing.
:::
---
## Technical Reference
### Libraries Used
| Library | Version | Purpose |
|---------|---------|---------|
| **pandas** | 3.0+ | Data manipulation with method chaining |
| **pyjanitor** | 0.26+ | Clean column names and data cleaning helpers |
| **Plotly** | 5.17+ | Interactive, web-ready visualizations |
| **Great Tables** | 0.8+ | Professional table formatting |
### Key Techniques Demonstrated
| Technique | Description |
|-----------|-------------|
| **Method Chaining** | Connect operations with `.` for readable pipelines |
| **`clean_names()`** | Standardize column names to snake_case |
| **Lambda Functions** | Inline calculations within `assign()` |
| **`.loc[]` Filtering** | Explicit boolean conditions for row selection |
| **`.query()` Filtering** | Readable string expressions with `@` variable support |
| **`groupby().agg()`** | Flexible aggregation with named outputs |
| **`pivot()` / `melt()`** | Reshape between long and wide formats |
---
*Tutorial by Alierwai Reng, Founder of PyStatR+. For data science training or collaboration, reach out at info@pystatrplus.org*
---
::: {.callout-tip icon="false"}
## About the Author
Alierwai Reng is the Founder and Lead Educator of PyStatR+, a data science educator, and analytics leader with expertise in statistics and healthcare analytics. His mission is to make technical knowledge accessible through clear, beginner-friendly education. He believes in "Education from the Heart."
**For training, consulting, or collaboration opportunities:**
📧 info@pystatrplus.org
🌐 [pystatrplus.org](https://pystatrplus.org)
:::
::: {.callout-note icon="false"}
## Editor's Note
This tutorial reflects PyStatR+'s core philosophy: that data science education should be accessible, practical, and empowering. We believe the best learning happens when complexity is distilled into clarity—without sacrificing rigor.
At PyStatR+, we teach from the heart by putting ourselves in your shoes—because learning is a partnership, not a solitary journey.
**PyStatR+**: *Learning Simplified. Communication Amplified.*
:::