Interactive Demographic Explorer: South Sudan 2008 Census

A Beginner’s Guide to Data Analysis with Pandas Method Chaining and Plotly

Python
Pandas
Interactive Visualization
Demographics
Census
Learn data analysis step-by-step! This beginner-friendly tutorial explores South Sudan’s demographics using pandas method chaining for clean, readable code and Plotly for interactive visualizations.
Author

Alierwai Reng

Published

November 20, 2024

Introduction

Welcome to this hands-on data analysis tutorial! By the end of this guide, you’ll understand how to:

  • Load and explore real-world census data
  • Clean data using pandas method chaining (a professional, readable approach)
  • Calculate summary statistics and group-level metrics
  • Create interactive visualizations that invite exploration

We’ll analyze South Sudan’s 2008 census data, but the techniques you learn apply to any dataset.

TipWhat is Method Chaining?

Method chaining connects multiple operations in a single, flowing statement—like a recipe where each step feeds into the next. Instead of creating many intermediate variables, we write clean, readable pipelines.


Part 1: Environment Setup

Step 1: Import Required Libraries

Every Python analysis starts with importing the tools we need. Think of libraries as specialized toolboxes—each designed for specific tasks.

Code
# Data manipulation
import pandas as pd          # The core library for data analysis
import janitor               # Adds helper functions like clean_names()

# Visualization
import plotly.express as px           # Quick, beautiful charts
import plotly.graph_objects as go     # Advanced chart customization
from plotly.subplots import make_subplots  # Multi-panel dashboards

# Tables
from great_tables import GT   # Professional table formatting

print("✅ All libraries loaded successfully!")
print(f"1) 📊 Pandas version: {pd.__version__};\n2) janitor version: {janitor.__version__}")
✅ All libraries loaded successfully!
1) 📊 Pandas version: 3.0.0rc1;
2) janitor version: 0.32.8
NoteLibrary Purposes
Library What It Does
pandas Load, clean, transform, and analyze data
janitor Simplifies common cleaning tasks (like fixing column names)
plotly Creates interactive, web-ready charts
great_tables Makes publication-quality tables

Step 2: Configure a Custom Theme

Consistent styling makes your visualizations look professional. We’ll create a PyStatR+ branded theme.

Code
import plotly.io as pio

# Define custom color palette and fonts
pio.templates["pystatr"] = go.layout.Template(
    layout=go.Layout(
        font=dict(family="Poppins, sans-serif", size=12),
        plot_bgcolor="#ffffff",
        paper_bgcolor="#ffffff",
        colorway=["#22d3ee", "#FFD700", "#06b6d4", "#0891b2", "#0e7490"],
        title_font=dict(size=18, color="#06b6d4", family="Poppins"),
    )
)

# Set as default for all charts
pio.templates.default = "pystatr"

print("🎨 Custom theme configured!")
🎨 Custom theme configured!

Part 2: Loading the Data

Step 3: Load Census Data from URL

Real datasets often live on the web. Pandas can read directly from URLs—no download required!

Code
# Define the data source URL
# Using a constant (ALL_CAPS) makes it easy to find and update later
data_url = (
    "https://raw.githubusercontent.com/tongakuot/r_tutorials/"
    "main/06-data-wrangling/00-input/ss_2008_census_data_raw.csv"
)

# Load the data into a DataFrame
# We name it 'df_raw' to indicate this is the original, uncleaned data
df_raw = pd.read_csv(data_url)

# Confirm successful load with basic info
print("✅ Data loaded successfully!")
print(f"📋 Rows: {df_raw.shape[0]:,}")
print(f"📋 Columns: {df_raw.shape[1]}")
✅ Data loaded successfully!
📋 Rows: 453
📋 Columns: 10
TipNaming Convention: snake_case

We use snake_case (lowercase words separated by underscores) for all variable names. This is the Python standard and makes code more readable.

  • ✅ Good: df_raw, gender_distribution, top_10_states
  • ❌ Avoid: dfRaw, GenderDistribution, Top10States

Part 3: Data Exploration

Before cleaning, we must understand our data. What columns exist? What types are they? Are there issues?

Step 4: Examine Column Names

Code
# View the original column names
print("📝 Original column names:")
print(df_raw.columns.tolist())
📝 Original column names:
['Region', 'Region Name', 'Region - RegionId', 'Variable', 'Variable Name', 'Age', 'Age Name', 'Scale', 'Units', '2008']

Notice any issues? Inconsistent capitalization, spaces, or special characters make columns harder to work with.

Step 5: Check Data Types

Code
# Check what type of data each column contains
print("🔍 Data types:")
print(df_raw.dtypes)
🔍 Data types:
Region                   str
Region Name              str
Region - RegionId        str
Variable                 str
Variable Name            str
Age                      str
Age Name                 str
Scale                    str
Units                    str
2008                 float64
dtype: object
NoteUnderstanding Data Types
Type Meaning Examples
int64 Whole numbers 1, 42, 1000
float64 Decimal numbers 3.14, 99.9
object Text (strings) “Male”, “Juba”

Step 6: Preview the Data

Code
# View the first 5 rows
# This gives us a concrete look at actual values
df_raw.head()
Region Region Name Region - RegionId Variable Variable Name Age Age Name Scale Units 2008
0 KN.A2 Upper Nile SS-NU KN.B2 Population, Total (Number) KN.C1 Total units Persons 964353.0
1 KN.A2 Upper Nile SS-NU KN.B2 Population, Total (Number) KN.C2 0 to 4 units Persons 150872.0
2 KN.A2 Upper Nile SS-NU KN.B2 Population, Total (Number) KN.C3 5 to 9 units Persons 151467.0
3 KN.A2 Upper Nile SS-NU KN.B2 Population, Total (Number) KN.C4 10 to 14 units Persons 126140.0
4 KN.A2 Upper Nile SS-NU KN.B2 Population, Total (Number) KN.C5 15 to 19 units Persons 103804.0

Step 7: Check for Missing Values

Code
# Count missing values in each column
missing_counts = df_raw.isna().sum()

print("❓ Missing values per column:")
print(missing_counts)
print(f"\n📊 Total missing values: {missing_counts.sum()}")
❓ Missing values per column:
Region               1
Region Name          1
Region - RegionId    3
Variable             3
Variable Name        3
Age                  3
Age Name             3
Scale                3
Units                3
2008                 3
dtype: int64

📊 Total missing values: 26

Part 4: Data Cleaning with Method Chaining

Now we transform our raw data into analysis-ready format. Method chaining lets us do this elegantly.

Step 8: Clean Column Names

The clean_names() function from pyjanitor standardizes all column names to snake_case.

Code
# Clean column names using method chaining
# The dot (.) connects operations into a pipeline
df = (
    df_raw
    # .clean_names()  # Converts to snake_case, removes special characters
    .rename(columns=lambda c: str(c).lower().replace(" ", "_"))
)

print("✅ Column names cleaned!")
print("📝 New column names:")
print(df.columns.tolist())
✅ Column names cleaned!
📝 New column names:
['region', 'region_name', 'region_-_regionid', 'variable', 'variable_name', 'age', 'age_name', 'scale', 'units', '2008']
TipWhy Method Chaining?

Compare these approaches:

Without chaining (harder to read):

df = df_raw.copy()
df = df.clean_names()
df = df.drop_duplicates()
df = df.reset_index(drop=True)

With chaining (clean and readable):

df = (
    df_raw
    .rename(columns=lambda c: str(c).lower().replace(" ", "_"))
    .drop_duplicates()
    .reset_index(drop=True)
)

The chained version reads like a recipe—each step flows naturally to the next.

Step 9: Identify String Columns

Before cleaning text values, we need to know which columns contain text.

Code
# Find columns with string (object) data type
string_columns = df.select_dtypes(include=['str']).columns.tolist()

print(f"📝 String columns to clean: {string_columns}")
📝 String columns to clean: ['region', 'region_name', 'region_-_regionid', 'variable', 'variable_name', 'age', 'age_name', 'scale', 'units']

Step 10: Clean String Values

Text data often has inconsistencies: extra spaces, mixed capitalization. Let’s fix that!

Code
# Clean all string columns using method chaining
# - .str.strip() removes leading/trailing whitespace
# - .str.title() capitalizes first letter of each word

df_cleaned = (
    df
    .assign(**{
        col: df[col].str.strip().str.title() 
        for col in string_columns
    })
)

print("✅ String values cleaned!")
df_cleaned.head()
✅ String values cleaned!
region region_name region_-_regionid variable variable_name age age_name scale units 2008
0 Kn.A2 Upper Nile Ss-Nu Kn.B2 Population, Total (Number) Kn.C1 Total Units Persons 964353.0
1 Kn.A2 Upper Nile Ss-Nu Kn.B2 Population, Total (Number) Kn.C2 0 To 4 Units Persons 150872.0
2 Kn.A2 Upper Nile Ss-Nu Kn.B2 Population, Total (Number) Kn.C3 5 To 9 Units Persons 151467.0
3 Kn.A2 Upper Nile Ss-Nu Kn.B2 Population, Total (Number) Kn.C4 10 To 14 Units Persons 126140.0
4 Kn.A2 Upper Nile Ss-Nu Kn.B2 Population, Total (Number) Kn.C5 15 To 19 Units Persons 103804.0
NoteWhat’s Happening in .assign()?

The assign() method creates or modifies columns. The **{...} syntax unpacks a dictionary, letting us apply the same operation to multiple columns at once.

This is equivalent to:

df['state'] = df['state'].str.strip().str.title()
df['gender'] = df['gender'].str.strip().str.title()
# ... and so on

Step 11a: Ensure Correct Data Types

Sometimes numeric columns get read as text. Let’s ensure population is numeric.

Code
# Convert population to numeric type
# errors='coerce' turns invalid values into NaN instead of crashing
df_cleaned = (
    df_cleaned
    .assign(population=lambda d: d["2008"].round().astype("Int64"))
)

print(f"✅ Population column type: {df_cleaned['population'].dtype}")
print(f"❓ Missing values after conversion: {df_cleaned['population'].isna().sum()}")
✅ Population column type: Int64
❓ Missing values after conversion: 3

Step 11b: Use the new pandas pd.col() method

Code
# Convert population to numeric type
# errors='coerce' turns invalid values into NaN instead of crashing
df_pd_col = (
    df_raw.assign(
        population=pd.col("2008").round().astype("Int64")
    )
)

print(f"✅ Population column type: {df_pd_col['population'].dtype}")
print(f"❓ Missing values after conversion: {df_pd_col['population'].isna().sum()}")
✅ Population column type: Int64
❓ Missing values after conversion: 3

Step 12: Drop Missing Values and Remove Aggregated Rows

Before analysis, we need to:

  1. Remove rows with missing population values — incomplete data skews results
  2. Filter out “Total” rows — these are pre-calculated aggregates that cause double-counting
Code
# Check current shape before cleaning
print(f"📊 Before cleaning: {df_cleaned.shape[0]:,} rows")
print(f"❓ Rows with missing population: {df_cleaned['population'].isna().sum()}")
📊 Before cleaning: 453 rows
❓ Rows with missing population: 3
Code
# Preview what's in variable_name to understand the data
print("🔍 Unique values in variable_name:")
print(df_cleaned['variable_name'].unique())
🔍 Unique values in variable_name:
<ArrowStringArray>
[ 'Population, Total (Number)',   'Population, Male (Number)',
 'Population, Female (Number)',                           nan]
Length: 4, dtype: str

Method 1: Using .dropna() and .query() (Recommended)

Code
# Drop NAs and filter out Total rows using method chaining
df_cleaned = (
    df_cleaned
    .dropna(subset=['population'])  # Remove rows where population is NaN
    .query("~variable_name.str.contains('Total', case=False)")  # Remove Total rows
    .reset_index(drop=True)
)

print(f"✅ After cleaning: {df_cleaned.shape[0]:,} rows")
✅ After cleaning: 300 rows

Method 2: Using .dropna() and .loc[] (Alternative)

Code
# Equivalent approach using .loc[] (shown for learning, not executed)
dropna_df = (
    df_cleaned
    .dropna(subset=['population'])
    .loc[lambda x: ~x['variable_name'].str.contains('Total', case=False)]
    .reset_index(drop=True)
)
TipWhy Remove “Total” Rows?

Census data often includes both detail rows (Male, Female) and aggregated Total rows.

If we keep Totals when calculating our own summaries:

Male: 100 + Female: 100 + Total: 200 = 400 ❌ (should be 200)

By removing Total rows first, we can calculate accurate statistics from scratch.

Code
# Verify only Male/Female remain
print("📝 Remaining values in variable_name:")
print(df_cleaned['variable_name'].unique())
📝 Remaining values in variable_name:
<ArrowStringArray>
['Population, Male (Number)', 'Population, Female (Number)']
Length: 2, dtype: str
NoteVariable Progression in This Tutorial

As we clean our data, we use descriptive variable names to track the state: - df_raw → Original data loaded from URL - df → Column names cleaned (snake_case) - df_cleaned → String values cleaned + data types fixed - census_df → Final dataset with extracted gender column (using method chaining) - df → Reassigned for convenient analysis

Notice how we use method chaining to go directly from df_cleaned to census_df without intermediate copies—this is the power of pandas method chaining!

Step 12: String Splitting - Extracting Gender Information

Now let’s explore the variable_name column and extract meaningful information from it. This column contains structured text like “Population - Male (Number)” that we can split and parse.

Code
# Examine what's in the variable_name column
print("🔍 Unique values in variable_name column:")
print(df_cleaned['variable_name'].unique())
🔍 Unique values in variable_name column:
<ArrowStringArray>
['Population, Male (Number)', 'Population, Female (Number)']
Length: 2, dtype: str
TipString Splitting Strategy

The variable_name column follows a pattern: “Category - Gender (Type)”

We want to extract the middle piece (Male, Female, Total) using various pandas string methods. Each approach has different use cases!

Method 1: Using .str.split() with Expand

The simplest approach: split on delimiter and select the piece you want.

Code
# Split on ' - ' and extract the second part (index 1)
df_method_1a = (
    df_cleaned
    .assign(
        # Split creates columns; we select the second one (index 1)
        gender_extracted=lambda x: x['variable_name'].str.split(' ', expand=True).get(1)
    )
)

print("✅ Method 1: Basic split with expand")
print(df_method_1a[['variable_name', 'gender_extracted']].drop_duplicates())
✅ Method 1: Basic split with expand
                  variable_name gender_extracted
0     Population, Male (Number)             Male
15  Population, Female (Number)           Female
Code
# Split on ' - ' and extract the second part (index 1)
df_method_1b = (
    df_cleaned
    .assign(
        # Split creates columns; we select the second one (index 1)
        gender_extracted=lambda x: x['variable_name'].str.split(' ', expand=True)[1]
    )
)

print("✅ Method 1: Basic split with expand")
print(df_method_1b[['variable_name', 'gender_extracted']].drop_duplicates())
✅ Method 1: Basic split with expand
                  variable_name gender_extracted
0     Population, Male (Number)             Male
15  Population, Female (Number)           Female
NoteHow .str.split() Works
  • split(' ') breaks the string at each occurrence of ’ ’
  • expand=True creates separate columns (like a DataFrame)
  • .get(1) | [1] selects the second column (Python uses 0-based indexing)

Method 2: Chain Multiple Splits

Sometimes you need to clean up the extracted text further.

Code
# Split twice: first on ' - ', then remove the parenthetical part
df_method_2 = (
    df_cleaned
    .assign(
        # First split to get middle section
        temp_extract=lambda x: x['variable_name'].str.split(' ', expand=True)[1],
        # Then split on ' (' to remove the (Number) part
        gender_clean=lambda x: x['temp_extract'].str.split(' \\(', expand=True)[0]
    )
    .drop(columns=['temp_extract'])  # Remove temporary column
)

print("✅ Method 3: Chained splits for clean extraction")
print(df_method_2[['variable_name', 'gender_clean']].drop_duplicates())
✅ Method 3: Chained splits for clean extraction
                  variable_name gender_clean
0     Population, Male (Number)         Male
15  Population, Female (Number)       Female

Method 3: Using .str.extract() with Regex

For more complex patterns, regular expressions provide powerful extraction.

Code
# Use regex pattern to extract text between ' - ' and ' ('
df_method_3 = (
    df_cleaned
    .assign(
        gender_regex=lambda x: x['variable_name'].str.extract(r' (\w+) \(', expand=False)
    )
)

print("✅ Method 3: Regex extraction")
print(df_method_3[['variable_name', 'gender_regex']].drop_duplicates())
✅ Method 3: Regex extraction
                  variable_name gender_regex
0     Population, Male (Number)         Male
15  Population, Female (Number)       Female
NoteUnderstanding the Regex Pattern

r' (\w+) \(' breaks down as: - r = raw string (treats backslashes literally) - = matches literal ” ” - (\w+) = captures one or more word characters (letters/digits) - \( = matches literal ” (” (backslash escapes the parenthesis)

The parentheses () create a “capture group” that .extract() returns.

Method 4: Using .str.replace() and String Slicing

Remove unwanted parts and extract what remains.

Code
# Remove the prefix and suffix to isolate the gender
df_method_4 = (
    df_cleaned
    .assign(
        gender_replace=lambda x: (
            x['variable_name']
            .str.replace('Population, ', '', regex=False)  # Remove prefix
            .str.replace(r' \(Number\)', '', regex=True)     # Remove suffix
        )
    )
)

print("✅ Method 4: String replacement")
print(df_method_4[['variable_name', 'gender_replace']].drop_duplicates())
✅ Method 4: String replacement
                  variable_name gender_replace
0     Population, Male (Number)           Male
15  Population, Female (Number)         Female

Method 5: One-liner with Method Chaining

Combine everything into a compact, readable pipeline. This creates our final cleaned dataframe: census_df.

Code
# The most elegant solution: all-in-one chained extraction
# This is our final cleaned dataset with extracted gender information
census_df = (
    df_cleaned
    .assign(
        gender=lambda x: (
            x['variable_name']
            .str.split(' ', expand=True)[1]  # Extract middle piece
            .str.split(' \(', expand=True)[0] # Remove parenthetical
            .str.strip()                     # Remove any extra whitespace
        )
    )
)

print("✅ Method 5: Complete chained solution")
print(census_df[['variable_name', 'gender']].drop_duplicates())

# Verify the extraction worked
print("\n🎯 Unique gender values extracted:")
print(census_df['gender'].unique())
✅ Method 5: Complete chained solution
                  variable_name  gender
0     Population, Male (Number)    Male
15  Population, Female (Number)  Female

🎯 Unique gender values extracted:
<ArrowStringArray>
['Male', 'Female']
Length: 2, dtype: str
TipWhich Method Should You Use?

Choose based on your needs:

Method Best For Pros Cons
Method 1 Simple, consistent patterns Fast, readable Requires fixed structure
Method 2 Multi-step cleaning Very clear logic More verbose
Method 3 Complex patterns Most flexible Requires regex knowledge
Method 4 Known prefix/suffix Intuitive Less robust to variations
Method 5 Production code Compact, professional Harder for beginners

For this tutorial, Method 5 is ideal: it’s clean, readable, and professional.

Step 13: Verify the Extraction

Let’s confirm our extracted gender values match what we expect.

Code
# Check the distribution of gender values
gender_counts = (
    census_df
    .groupby('gender', as_index=False)
    .agg(
        count=('gender', 'size'),
        total_population=('population', 'sum')
    )
    .sort_values('total_population', ascending=False)
)

print("📊 Gender distribution in extracted data:")
print(gender_counts)
📊 Gender distribution in extracted data:
   gender  count  total_population
1    Male    150           8574600
0  Female    150           7946380

Step 14: Prepare Data for Analysis

Let’s assign our cleaned dataframe with the extracted gender column back to df for convenience in subsequent analyses:

Code
# Use the cleaned dataframe with extracted gender column for all subsequent analyses
print("✅ Ready for analysis!")
print(f"📋 Final dataset: {census_df.shape[0]:,} rows × {df.shape[1]} columns")
✅ Ready for analysis!
📋 Final dataset: 300 rows × 10 columns

Step 15: Final Cleaned Dataset Summary

Code
# Summary of our cleaned data with extracted gender column
print("=" * 50)
print("🎉 DATA CLEANING COMPLETE!")
print("=" * 50)
print(f"📋 Final shape: {census_df.shape[0]:,} rows × {census_df.shape[1]} columns")
print(f"📝 Columns: {census_df.columns.tolist()}")
print("\n🔍 Preview:")
df.head()
==================================================
🎉 DATA CLEANING COMPLETE!
==================================================
📋 Final shape: 300 rows × 12 columns
📝 Columns: ['region', 'region_name', 'region_-_regionid', 'variable', 'variable_name', 'age', 'age_name', 'scale', 'units', '2008', 'population', 'gender']

🔍 Preview:
region region_name region_-_regionid variable variable_name age age_name scale units 2008
0 KN.A2 Upper Nile SS-NU KN.B2 Population, Total (Number) KN.C1 Total units Persons 964353.0
1 KN.A2 Upper Nile SS-NU KN.B2 Population, Total (Number) KN.C2 0 to 4 units Persons 150872.0
2 KN.A2 Upper Nile SS-NU KN.B2 Population, Total (Number) KN.C3 5 to 9 units Persons 151467.0
3 KN.A2 Upper Nile SS-NU KN.B2 Population, Total (Number) KN.C4 10 to 14 units Persons 126140.0
4 KN.A2 Upper Nile SS-NU KN.B2 Population, Total (Number) KN.C5 15 to 19 units Persons 103804.0

Part 5: Filtering Data — Two Approaches

Before diving into analysis, let’s learn how to filter (subset) data. Pandas offers two main approaches: .loc[] and .query(). Both are powerful—choose based on readability and context.

Step 16: Filtering with .loc[]

The .loc[] accessor filters rows using boolean conditions. It’s explicit and works well for simple filters.

Code
# Filter rows where gender is 'Male'
# .loc[condition] returns rows where condition is True

males_only = census_df.loc[census_df['gender'] == 'Male']

print(f"📊 Total rows: {len(df):,}")
print(f"👨 Male rows: {len(males_only):,}")
males_only.head()
📊 Total rows: 453
👨 Male rows: 150
region region_name region_-_regionid variable variable_name age age_name scale units 2008 population gender
0 Kn.A2 Upper Nile Ss-Nu Kn.B5 Population, Male (Number) Kn.C1 Total Units Persons 525430.0 525430 Male
1 Kn.A2 Upper Nile Ss-Nu Kn.B5 Population, Male (Number) Kn.C2 0 To 4 Units Persons 82690.0 82690 Male
2 Kn.A2 Upper Nile Ss-Nu Kn.B5 Population, Male (Number) Kn.C3 5 To 9 Units Persons 83744.0 83744 Male
3 Kn.A2 Upper Nile Ss-Nu Kn.B5 Population, Male (Number) Kn.C4 10 To 14 Units Persons 71027.0 71027 Male
4 Kn.A2 Upper Nile Ss-Nu Kn.B5 Population, Male (Number) Kn.C5 15 To 19 Units Persons 57387.0 57387 Male
Code
# Rename columns 
census_df = (
    census_df
    .rename(columns={
            'region_name': 'state',
            'age_name': 'age_category'
        }
    )
    .query("age_category != 'Total'")
    # select columns of interest
    [['state', 'gender', 'age_category', 'population']]
)

# Inspect output
print(census_df)
                 state  gender age_category  population
1           Upper Nile    Male       0 To 4       82690
2           Upper Nile    Male       5 To 9       83744
3           Upper Nile    Male     10 To 14       71027
4           Upper Nile    Male     15 To 19       57387
5           Upper Nile    Male     20 To 24       42521
..                 ...     ...          ...         ...
295  Eastern Equatoria  Female     45 To 49       13727
296  Eastern Equatoria  Female     50 To 54        9482
297  Eastern Equatoria  Female     55 To 59        5740
298  Eastern Equatoria  Female     60 To 64        5274
299  Eastern Equatoria  Female          65+        8637

[280 rows x 4 columns]
Code
# Multiple conditions with .loc[]
# Use & (and), | (or), ~ (not)
# IMPORTANT: Wrap each condition in parentheses!

# Filter: Males in 'Jonglei' state
males_in_jonglei = census_df.loc[
    ((census_df['gender'] == 'Male') & (census_df['state'] == 'Jonglei'))
]

print(f"👨 Males in Jonglei: {len(males_in_jonglei):,} rows")
males_in_jonglei.head()
👨 Males in Jonglei: 14 rows
state gender age_category population
31 Jonglei Male 0 To 4 115806
32 Jonglei Male 5 To 9 120885
33 Jonglei Male 10 To 14 101752
34 Jonglei Male 15 To 19 80739
35 Jonglei Male 20 To 24 62047
Note.loc[] Syntax Rules
Operation Syntax
Equals df['col'] == 'value'
Not equals df['col'] != 'value'
Greater than df['col'] > 100
Multiple AND (cond1) & (cond2)
Multiple OR (cond1) \| (cond2)
NOT ~(condition)

Step 17: Filtering with .query()

The .query() method uses a string expression—often more readable, especially for complex filters.

Code
# Same filter using .query() - cleaner syntax!
males_only_query = census_df.query("gender == 'Male'")

print(f"👨 Male rows (via query): {len(males_only_query):,}")
males_only_query.head()
👨 Male rows (via query): 140
state gender age_category population
1 Upper Nile Male 0 To 4 82690
2 Upper Nile Male 5 To 9 83744
3 Upper Nile Male 10 To 14 71027
4 Upper Nile Male 15 To 19 57387
5 Upper Nile Male 20 To 24 42521
Code
# Multiple conditions with .query() - much cleaner!
# No parentheses needed, use 'and', 'or', 'not'

males_in_jonglei_query = census_df.query("gender == 'Male' and state == 'Jonglei'")

print(f"👨 Males in Jonglei (via query): {len(males_in_jonglei_query):,} rows")
males_in_jonglei_query.head()
👨 Males in Jonglei (via query): 14 rows
state gender age_category population
31 Jonglei Male 0 To 4 115806
32 Jonglei Male 5 To 9 120885
33 Jonglei Male 10 To 14 101752
34 Jonglei Male 15 To 19 80739
35 Jonglei Male 20 To 24 62047

Step 18: Using Variables in .query() with @

A powerful feature: reference Python variables inside query strings using @.

Code
# Define filter criteria as variables
target_state = 'Upper Nile'
min_population = 1000

# Use @ to reference variables inside the query string
filtered_df = census_df.query("state == @target_state and population >= @min_population")

print(f"🎯 Rows matching criteria: {len(filtered_df):,}")
filtered_df.head()
🎯 Rows matching criteria: 28
state gender age_category population
1 Upper Nile Male 0 To 4 82690
2 Upper Nile Male 5 To 9 83744
3 Upper Nile Male 10 To 14 71027
4 Upper Nile Male 15 To 19 57387
5 Upper Nile Male 20 To 24 42521
Code
# Filter using a list of values with 'in'
states_of_interest = ['Jonglei', 'Upper Nile', 'Unity']

# .loc approach - use .isin()
result_loc = census_df.loc[census_df['state'].isin(states_of_interest)]

# .query approach - use 'in @list'
result_query = census_df.query("state in @states_of_interest")

print(f"📊 Results match: {len(result_loc) == len(result_query)}")
print(f"📋 Rows found: {len(result_query):,}")
📊 Results match: True
📋 Rows found: 84
TipWhen to Use Each Method
Use .loc[] When… Use .query() When…
Simple, single conditions Complex, multi-condition filters
You need to select specific columns too Readability is priority
Working with column names that have spaces Filter logic reads like English
Programmatic/dynamic column selection Using variables with @

In method chains, .query() is often cleaner:

# Clean and readable
result = (
    df
    .query("state == 'Jonglei' and gender == 'Male'")
    .groupby('age_category')
    .agg(population=('population', 'sum'))
)

Step 19: Combining .loc[] with Column Selection

A unique advantage of .loc[]: select rows AND columns simultaneously.

Code
# Select specific rows AND columns with .loc[rows, columns]
subset = census_df.loc[
    census_df['state'] == 'Jonglei',           # Row condition
    ['state', 'gender', 'population']    # Columns to keep
]

print("📊 Filtered rows with selected columns:")
subset.head()
📊 Filtered rows with selected columns:
state gender population
31 Jonglei Male 115806
32 Jonglei Male 120885
33 Jonglei Male 101752
34 Jonglei Male 80739
35 Jonglei Male 62047
Code
# Equivalent with .query() + column selection
subset_query = (
    census_df
    .query("state == 'Jonglei'")
    [['state', 'gender', 'population']]  # Column selection after query
)

subset_query.head()
state gender population
31 Jonglei Male 115806
32 Jonglei Male 120885
33 Jonglei Male 101752
34 Jonglei Male 80739
35 Jonglei Male 62047

Part 6: Summary Statistics

Now that our data is clean, let’s calculate some high-level metrics.

Step 20: Calculate Overview Metrics

Code
# Build a summary table of key statistics
summary_stats = pd.DataFrame({
    'metric': [
        'Total Population',
        'Number of States', 
        'Age Categories',
        'Gender Groups'
    ],
    'value': [
        census_df['population'].sum(),
        census_df['state'].nunique(),
        census_df['age_category'].nunique(),
        census_df['gender'].nunique()
    ]
})

summary_stats
metric value
0 Total Population 8260490
1 Number of States 10
2 Age Categories 14
3 Gender Groups 2

Step 21: Display as Formatted Table

Let’s make this look professional with Great Tables.

Code
(
    GT(summary_stats)
    .tab_header(
        title="South Sudan 2008 Census Overview",
        subtitle="Key Summary Statistics"
    )
    .fmt_number(
        columns="value",
        rows=[0],  # Only format the first row (population)
        decimals=0,
        use_seps=True
    )
    .cols_label(
        metric="Metric",
        value="Value"
    )
    .cols_align(align="center")
)
South Sudan 2008 Census Overview
Key Summary Statistics
Metric Value
Total Population 8,260,490
Number of States 10
Age Categories 14
Gender Groups 2

Step 22: Age Category Cleaning

Method 1: Using .replace() with a mapping dictionary

Code
# Define the age category mapping (matching actual data format: '0 To 4')
age_mapping = {
    '0 To 4': '0-14',
    '5 To 9': '0-14',
    '10 To 14': '0-14',
    '15 To 19': '15-24',
    '20 To 24': '15-24',
    '25 To 29': '25-34',
    '30 To 34': '25-34',
    '35 To 39': '35-44',
    '40 To 44': '35-44',
    '45 To 49': '45-54',
    '50 To 54': '45-54',
    '55 To 59': '55-64',
    '60 To 64': '55-64',
    '65+': '65+'  # Already in correct format
}

# Apply the mapping using .replace()
census_new_age_categories = (
    census_df
    .assign(
        age_category=lambda x: x['age_category'].replace(age_mapping)
    )
)

print("✅ Age categories standardized!")
print("\n🎯 New age categories:")
print(sorted(census_new_age_categories['age_category'].unique()))
✅ Age categories standardized!

🎯 New age categories:
['0-14', '15-24', '25-34', '35-44', '45-54', '55-64', '65+']

Method 2: Using case_when() (Pandas 2.2+)

Code
## Using case_when for conditional transformations
# case_when is called on a Series instance, not as a class method
census_clean = (
    census_df
    .assign(
        age_category=lambda x: x['age_category'].case_when(
            caselist=[
                (x['age_category'].isin(['0 To 4', '5 To 9', '10 To 14']), '0-14'),
                (x['age_category'].isin(['15 To 19', '20 To 24']), '15-24'),
                (x['age_category'].isin(['25 To 29', '30 To 34']), '25-34'),
                (x['age_category'].isin(['35 To 39', '40 To 44']), '35-44'),
                (x['age_category'].isin(['45 To 49', '50 To 54']), '45-54'),
                (x['age_category'].isin(['55 To 59', '60 To 64']), '55-64'),
                (x['age_category'] == '65+', '65+'),
            ]
        )
    )
)

print("✅ Age categories standardized using case_when!")
print("\n🎯 New age categories:")
print(sorted(census_clean['age_category'].unique()))
✅ Age categories standardized using case_when!

🎯 New age categories:
['0-14', '15-24', '25-34', '35-44', '45-54', '55-64', '65+']

Part 7: Gender Analysis

Let’s explore how the population breaks down by gender.

Step 22: Calculate National Gender Distribution

This demonstrates a complete method chaining pipeline:

Code
# Aggregate population by gender
# This pipeline: groups → sums → sorts → calculates percentage

gender_distribution = (
    census_clean
    .groupby('gender', as_index=False)           # Group rows by gender
    .agg(population=('population', 'sum'))        # Sum population in each group
    .sort_values('population', ascending=False)   # Sort largest first
    .assign(
        percentage=lambda x: (x['population'] / x['population'].sum() * 100).round(2)
    )
    .reset_index(drop=True)                       # Clean up the index
)

gender_distribution
gender population percentage
0 Male 4287300 51.9
1 Female 3973190 48.1
NoteBreaking Down the Pipeline
Step What It Does
.groupby('gender') Creates groups: one for Male, one for Female
.agg(population=('population', 'sum')) Sums the population column within each group
.sort_values(...) Reorders rows by population (descending)
.assign(percentage=...) Adds a new column with calculated percentages
.reset_index(drop=True) Creates a clean 0, 1, 2… index

Step 23: Display Gender Table

Code
(
    GT(gender_distribution)
    .tab_header(
        title="National Gender Distribution",
        subtitle="Population and Percentages"
    )
    .fmt_number(columns="population", decimals=0, use_seps=True)
    .fmt_number(columns="percentage", decimals=2)
    .cols_label(
        gender="Gender",
        population="Population",
        percentage="Percentage (%)"
    )
)
National Gender Distribution
Population and Percentages
Gender Population Percentage (%)
Male 4,287,300 51.90
Female 3,973,190 48.10

Step 24: Visualize with Interactive Donut Chart

Interactive charts let viewers explore data themselves—hover, click, zoom!

Code
# Create a donut chart (pie chart with a hole)
fig_gender = px.pie(
    gender_distribution,
    values='population',
    names='gender',
    title='Gender Distribution Across South Sudan',
    color='gender',
    color_discrete_map={'Male': '#22d3ee', 'Female': '#FFD700'},
    hole=0.4  # Creates the donut hole
)

# Customize appearance
fig_gender.update_traces(
    textposition='inside',
    textinfo='percent+label',
    marker=dict(line=dict(color='#ffffff', width=2))
)

fig_gender.update_layout(
    title_font_size=18,
    title_font_color="#06b6d4",
    height=500,
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=-0.15,
        xanchor="center",
        x=0.5
    )
)

fig_gender.show()

Part 8: State-Level Gender Analysis

Now let’s dig deeper—how does gender distribution vary by state?

Step 25: Aggregate by State and Gender

Code
# Step 18a: Group by both state AND gender
state_gender_long = (
    census_clean
    .groupby(['state', 'gender'], as_index=False)
    .agg(population=('population', 'sum'))
)

print("📊 Long format (each state-gender combo is one row):")
state_gender_long.head(10)
📊 Long format (each state-gender combo is one row):
state gender population
0 Central Equatoria Female 521835
1 Central Equatoria Male 581722
2 Eastern Equatoria Female 440974
3 Eastern Equatoria Male 465187
4 Jonglei Female 624275
5 Jonglei Male 734327
6 Lakes Female 329850
7 Lakes Male 365880
8 Northern Bahr El Ghazal Female 372608
9 Northern Bahr El Ghazal Male 348290

Step 26: Pivot to Wide Format

Sometimes we need data in “wide” format—one row per state, separate columns for each gender.

Code
# Pivot: rows become states, columns become genders
state_gender_wide = (
    state_gender_long
    .pivot(index='state', columns='gender', values='population')
    .reset_index()
    # .clean_names()  # Standardize the new column names
    .rename(columns=lambda c: c.lower())
)

print("📊 Wide format (one row per state):")
state_gender_wide.head()
📊 Wide format (one row per state):
gender state female male
0 Central Equatoria 521835 581722
1 Eastern Equatoria 440974 465187
2 Jonglei 624275 734327
3 Lakes 329850 365880
4 Northern Bahr El Ghazal 372608 348290

Step 27: Calculate Derived Metrics

Now we can calculate useful metrics like gender ratio.

Code
# Add calculated columns
state_gender = (
    state_gender_wide
    .assign(
        total=lambda x: x['male'] + x['female'],
        male_pct=lambda x: (x['male'] / x['total'] * 100).round(2),
        female_pct=lambda x: (x['female'] / x['total'] * 100).round(2),
        gender_ratio=lambda x: (x['male'] / x['female'] * 100).round(2)  # Males per 100 females
    )
    .sort_values('total', ascending=False)
    .reset_index(drop=True)
)

print("📊 State-level gender analysis:")
state_gender.head(10)
📊 State-level gender analysis:
gender state female male total male_pct female_pct gender_ratio
0 Jonglei 624275 734327 1358602 54.05 45.95 117.63
1 Central Equatoria 521835 581722 1103557 52.71 47.29 111.48
2 Warrap 502194 470734 972928 48.38 51.62 93.74
3 Upper Nile 438923 525430 964353 54.49 45.51 119.71
4 Eastern Equatoria 440974 465187 906161 51.34 48.66 105.49
5 Northern Bahr El Ghazal 372608 348290 720898 48.31 51.69 93.47
6 Lakes 329850 365880 695730 52.59 47.41 110.92
7 Western Equatoria 300586 318443 619029 51.44 48.56 105.94
8 Unity 285554 300247 585801 51.25 48.75 105.15
9 Western Bahr El Ghazal 156391 177040 333431 53.1 46.9 113.2
NoteUnderstanding Gender Ratio

A gender ratio of 100 means equal males and females. - Above 100 = more males - Below 100 = more females

Step 28: Display Top 10 States Table

Code
top_5_states = state_gender.head(5)

(
    GT(top_5_states)
    .tab_header(
        title="Top 5 States by Population",
        subtitle="Gender Breakdown and Ratios"
    )
    .fmt_number(columns=['male', 'female', 'total'], decimals=0, use_seps=True)
    .fmt_number(columns=['male_pct', 'female_pct', 'gender_ratio'], decimals=2)
    .cols_label(
        state="State",
        male="Male",
        female="Female",
        total="Total",
        male_pct="Male %",
        female_pct="Female %",
        gender_ratio="Ratio (M/100F)"
    )
    .data_color(
        columns="gender_ratio",
        palette=["#FFD700", "#ffffff", "#22d3ee"]
    )
)
Top 5 States by Population
Gender Breakdown and Ratios
State Female Male Total Male % Female % Ratio (M/100F)
Jonglei 624,275 734,327 1,358,602 54.05 45.95 117.63
Central Equatoria 521,835 581,722 1,103,557 52.71 47.29 111.48
Warrap 502,194 470,734 972,928 48.38 51.62 93.74
Upper Nile 438,923 525,430 964,353 54.49 45.51 119.71
Eastern Equatoria 440,974 465,187 906,161 51.34 48.66 105.49

Step 29: Visualize State Gender Distribution

Code
# Reshape data for plotting (back to long format)
state_gender_plot = (
    state_gender
    .head(5)
    .melt(
        id_vars=['state'],
        value_vars=['male', 'female'],
        var_name='gender',
        value_name='population'
    )
    .assign(gender=lambda x: x['gender'].str.title())  # Capitalize for display
)

# Create grouped bar chart
fig_state_gender = px.bar(
    state_gender_plot,
    x='state',
    y='population',
    color='gender',
    title='Population by State and Gender (Top 5 States)',
    labels={'population': 'Population', 'state': 'State'},
    color_discrete_map={'Male': '#22d3ee', 'Female': '#FFD700'},
    barmode='group'
)

fig_state_gender.update_layout(
    title_font_size=18,
    title_font_color="#06b6d4",
    xaxis_tickangle=-45,
    height=600,
    hovermode='x unified',
    legend=dict(
        title="Gender",
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    )
)

fig_state_gender.update_traces(
    hovertemplate='<b>%{x}</b><br>Population: %{y:,}<extra></extra>'
)

fig_state_gender.show()

Part 9: Age Category Analysis

Step 30: Calculate National Age Distribution

Code
# Aggregate by age category
age_distribution = (
    census_clean
    .groupby('age_category', as_index=False)
    .agg(population=('population', 'sum'))
    .sort_values('population', ascending=False)
    .assign(
        percentage=lambda x: (x['population'] / x['population'].sum() * 100).round(2)
    )
    .reset_index(drop=True)
)

age_distribution
age_category population percentage
0 0-14 3659337 44.3
1 15-24 1628835 19.72
2 25-34 1234926 14.95
3 35-44 815517 9.87
4 45-54 473365 5.73
5 55-64 237426 2.87
6 65+ 211084 2.56

Step 31: Display Age Distribution Table

Code
(
    GT(age_distribution)
    .tab_header(
        title="Population Distribution by Age Category",
        subtitle="National Summary"
    )
    .fmt_number(columns="population", decimals=0, use_seps=True)
    .fmt_number(columns="percentage", decimals=2)
    .cols_label(
        age_category="Age Category",
        population="Population",
        percentage="Percentage (%)"
    )
    .data_color(
        columns="percentage",
        palette=["#000000", "#0891b2", "#22d3ee", "#FFD700"]
    )
)
Population Distribution by Age Category
National Summary
Age Category Population Percentage (%)
0-14 3,659,337 44.30
15-24 1,628,835 19.72
25-34 1,234,926 14.95
35-44 815,517 9.87
45-54 473,365 5.73
55-64 237,426 2.87
65+ 211,084 2.56

Step 32: Visualize Age Distribution

Code
# Create horizontal bar chart (easier to read category names)
fig_age = px.bar(
    age_distribution.sort_values('population'),  # Sort for visual clarity
    x='population',
    y='age_category',
    orientation='h',
    title='Population Distribution by Age Category',
    labels={'population': 'Population', 'age_category': 'Age Category'},
    color='population',
    color_continuous_scale=['#000000', '#22d3ee', '#FFD700']
)

fig_age.update_layout(
    title_font_size=18,
    title_font_color="#06b6d4",
    height=500,
    showlegend=False
)

fig_age.update_traces(
    hovertemplate='<b>%{y}</b><br>Population: %{x:,}<extra></extra>'
)

fig_age.show()

Part 10: Age and Gender Interaction

The population pyramid is a classic demographic visualization.

Step 33: Prepare Age-Gender Cross-Tabulation

Code
# Aggregate by both age and gender
age_gender_long = (
    census_clean
    .groupby(['age_category', 'gender'], as_index=False)
    .agg(population=('population', 'sum'))
)

# Pivot and calculate metrics
age_gender = (
    age_gender_long
    .pivot(index='age_category', columns='gender', values='population')
    .reset_index()
    # .clean_names()
    .rename(columns=lambda c: c.lower())
    .assign(
        total=lambda x: x['male'] + x['female'],
        female_pct=lambda x: (x['female'] / x['total'] * 100).round(2)
    )
    .sort_values('total', ascending=False)
    .reset_index(drop=True)
)

age_gender
gender age_category female male total female_pct
0 0-14 1711988 1947349 3659337 46.78
1 15-24 805145 823690 1628835 49.43
2 25-34 640997 593929 1234926 51.91
3 35-44 403039 412478 815517 49.42
4 45-54 218612 254753 473365 46.18
5 55-64 105345 132081 237426 44.37
6 65+ 88064 123020 211084 41.72

Step 34: Create Population Pyramid

A population pyramid shows males on the left, females on the right—a powerful way to visualize demographic structure.

Code
# Prepare data: males get negative values (to extend left)
pyramid_data = (
    age_gender_long
    .assign(
        population_adjusted=lambda x: x.apply(
            lambda row: -row['population'] if row['gender'] == 'Male' else row['population'],
            axis=1
        )
    )
)

# Separate by gender
male_pyramid = pyramid_data.query("gender == 'Male'")
female_pyramid = pyramid_data.query("gender == 'Female'")

# Build the pyramid
fig_pyramid = go.Figure()

# Male bars (left side)
fig_pyramid.add_trace(
    go.Bar(
        y=male_pyramid['age_category'],
        x=male_pyramid['population_adjusted'],
        name='Male',
        orientation='h',
        marker=dict(color='#22d3ee'),
        hovertemplate='<b>Male - %{y}</b><br>Population: %{customdata:,}<extra></extra>',
        customdata=male_pyramid['population']
    )
)

# Female bars (right side)
fig_pyramid.add_trace(
    go.Bar(
        y=female_pyramid['age_category'],
        x=female_pyramid['population'],
        name='Female',
        orientation='h',
        marker=dict(color='#FFD700'),
        hovertemplate='<b>Female - %{y}</b><br>Population: %{x:,}<extra></extra>'
    )
)

fig_pyramid.update_layout(
    title='Population Pyramid: Age and Gender Distribution',
    title_font_size=18,
    title_font_color="#06b6d4",
    xaxis=dict(
        title='Population',
        tickvals=[-400000, -300000, -200000, -100000, 0, 100000, 200000, 300000, 400000],
        ticktext=['400K', '300K', '200K', '100K', '0', '100K', '200K', '300K', '400K']
    ),
    yaxis=dict(title='Age Category'),
    barmode='overlay',
    height=600,
    bargap=0.1,
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    )
)

fig_pyramid.show()

Part 11: State Rankings

Step 35: Create State Population Rankings

Code
# Calculate comprehensive state rankings
state_ranking = (
    census_clean
    .groupby('state', as_index=False)
    .agg(population=('population', 'sum'))
    .sort_values('population', ascending=False)
    .assign(
        rank=lambda x: range(1, len(x) + 1),
        percentage=lambda x: (x['population'] / x['population'].sum() * 100).round(2),
        cumulative_pct=lambda x: x['percentage'].cumsum().round(2)
    )
    .reset_index(drop=True)
)

state_ranking.head(5)
state population rank percentage cumulative_pct
0 Jonglei 1358602 1 16.45 16.45
1 Central Equatoria 1103557 2 13.36 29.81
2 Warrap 972928 3 11.78 41.59
3 Upper Nile 964353 4 11.67 53.26
4 Eastern Equatoria 906161 5 10.97 64.23

Step 36: Display Top 10 States Table

Code
top_5 = state_ranking.head(5)

(
    GT(top_5)
    .tab_header(
        title="Top 5 Most Populous States",
        subtitle="Population, Share, and Cumulative Distribution"
    )
    .fmt_number(columns="population", decimals=0, use_seps=True)
    .fmt_number(columns=["percentage", "cumulative_pct"], decimals=2)
    .cols_label(
        rank="Rank",
        state="State",
        population="Population",
        percentage="% of Total",
        cumulative_pct="Cumulative %"
    )
    .data_color(
        columns="population",
        palette=["#000000", "#0891b2", "#22d3ee", "#FFD700"]
    )
)
Top 5 Most Populous States
Population, Share, and Cumulative Distribution
State Population Rank % of Total Cumulative %
Jonglei 1,358,602 1 16.45 16.45
Central Equatoria 1,103,557 2 13.36 29.81
Warrap 972,928 3 11.78 41.59
Upper Nile 964,353 4 11.67 53.26
Eastern Equatoria 906,161 5 10.97 64.23

Step 37: Visualize with Treemap

Treemaps show hierarchical data as nested rectangles—size represents magnitude.

Code
fig_treemap = px.treemap(
    top_5,
    path=['state'],
    values='population',
    title='State Population Distribution (Top 10)',
    color='population',
    color_continuous_scale=['#000000', '#0891b2', '#22d3ee', '#FFD700']
)

fig_treemap.update_traces(
    textposition="middle center",
    texttemplate="<b>%{label}</b><br>%{value:,}",
    hovertemplate='<b>%{label}</b><br>Population: %{value:,}<br>Share: %{percentRoot:.2%}<extra></extra>'
)

fig_treemap.update_layout(
    title_font_size=18,
    title_font_color="#06b6d4",
    height=600
)

fig_treemap.show()

Part 12: State-Level Age Distribution

Step 38: Analyze Age in Top 5 States

Code
# Get top 5 states by population
top_5_states = state_ranking.head(5)['state'].tolist()

print(f"📍 Top 5 states: {top_5_states}")
📍 Top 5 states: ['Jonglei', 'Central Equatoria', 'Warrap', 'Upper Nile', 'Eastern Equatoria']
Code
# Calculate age distribution within each top state
state_age_distribution = (
    census_clean
    .query("state in @top_5_states")  # Filter to top 5 only
    .groupby(['state', 'age_category'], as_index=False)
    .agg(population=('population', 'sum'))
    .assign(
        percentage=lambda x: (
            x.groupby('state')['population']
            .transform(lambda pop: pop / pop.sum() * 100)
            .round(2)
        )
    )
)

state_age_distribution.head(10)
state age_category population percentage
0 Central Equatoria 0-14 463463 42.0
1 Central Equatoria 15-24 240239 21.77
2 Central Equatoria 25-34 181599 16.46
3 Central Equatoria 35-44 110539 10.02
4 Central Equatoria 45-54 59749 5.41
5 Central Equatoria 55-64 27963 2.53
6 Central Equatoria 65+ 20005 1.81
7 Eastern Equatoria 0-14 400689 44.22
8 Eastern Equatoria 15-24 197580 21.8
9 Eastern Equatoria 25-34 130718 14.43
TipThe query() Method

query("state in @top_5_states") is a readable way to filter data. The @ symbol references a Python variable inside the query string.

This is equivalent to:

df[df['state'].isin(top_5_states)]

Step 39: Visualize State Age Distribution

Code
fig_state_age = px.bar(
    state_age_distribution,
    x='state',
    y='population',
    color='age_category',
    title='Age Distribution in Top 5 States',
    labels={
        'population': 'Population', 
        'state': 'State', 
        'age_category': 'Age Category'
    },
    text='percentage',
    color_discrete_sequence=['#000000', '#0891b2', '#22d3ee', '#06b6d4', '#FFD700', '#f59e0b']
)

fig_state_age.update_traces(
    texttemplate='%{text:.1f}%',
    textposition='inside',
    hovertemplate='<b>%{x}</b><br>%{fullData.name}<br>Population: %{y:,}<br>Percentage: %{text:.2f}%<extra></extra>'
)

fig_state_age.update_layout(
    title_font_size=18,
    title_font_color="#06b6d4",
    xaxis_tickangle=-45,
    height=600,
    barmode='stack',
    legend=dict(
        title="Age Category",
        orientation="v",
        yanchor="middle",
        y=0.5,
        xanchor="left",
        x=1.02
    )
)

fig_state_age.show()

Part 13: Comprehensive Dashboard

Let’s combine everything into a single, multi-panel dashboard.

Step 40: Build Multi-Panel Dashboard

Code
# Filter to top 5 states
top_5_df = census_clean.query("state in @top_5_states")

# Create 2x2 subplot grid
fig_dashboard = make_subplots(
    rows=2, cols=2,
    subplot_titles=(
        'Total Population by State',
        'Gender Distribution',
        'Age Distribution',
        'Gender Ratio (Males per 100 Females)'
    ),
    specs=[
        [{"type": "bar"}, {"type": "bar"}],
        [{"type": "bar"}, {"type": "bar"}]
    ],
    vertical_spacing=0.12,
    horizontal_spacing=0.1
)

# ----- Panel 1: Total Population -----
panel1_data = (
    top_5_df
    .groupby('state', as_index=False)
    .agg(population=('population', 'sum'))
    .sort_values('population', ascending=False)
)

fig_dashboard.add_trace(
    go.Bar(
        x=panel1_data['state'], 
        y=panel1_data['population'], 
        marker_color='#22d3ee', 
        name='Total Pop',
        showlegend=False
    ),
    row=1, col=1
)

# ----- Panel 2: Gender Distribution -----
panel2_data = (
    top_5_df
    .groupby(['state', 'gender'], as_index=False)
    .agg(population=('population', 'sum'))
)

for gender, color in [('Male', '#22d3ee'), ('Female', '#FFD700')]:
    gender_data = panel2_data.query("gender == @gender")
    fig_dashboard.add_trace(
        go.Bar(
            x=gender_data['state'], 
            y=gender_data['population'], 
            name=gender, 
            marker_color=color
        ),
        row=1, col=2
    )

# ----- Panel 3: Age Distribution (Stacked) -----
panel3_data = (
    top_5_df
    .groupby(['state', 'age_category'], as_index=False)
    .agg(population=('population', 'sum'))
)

age_categories = panel3_data['age_category'].unique()
colors = ['#000000', '#0891b2', '#22d3ee', '#06b6d4', '#FFD700']

for i, age_cat in enumerate(age_categories[:5]):
    age_data = panel3_data.query("age_category == @age_cat")
    fig_dashboard.add_trace(
        go.Bar(
            x=age_data['state'], 
            y=age_data['population'], 
            name=age_cat, 
            marker_color=colors[i % len(colors)],
            showlegend=False
        ),
        row=2, col=1
    )

# ----- Panel 4: Gender Ratio -----
panel4_data = (
    top_5_df
    .groupby(['state', 'gender'], as_index=False)
    .agg(population=('population', 'sum'))
    .pivot(index='state', columns='gender', values='population')
    .reset_index()
    # .clean_names()
    .rename(columns=lambda c: c.lower())
    .assign(
        ratio=lambda x: (x['male'] / x['female'] * 100).round(2)
    )
)

fig_dashboard.add_trace(
    go.Bar(
        x=panel4_data['state'], 
        y=panel4_data['ratio'], 
        marker_color='#FFD700', 
        name='Gender Ratio',
        showlegend=False
    ),
    row=2, col=2
)

# ----- Layout -----
fig_dashboard.update_layout(
    title_text="Comprehensive Demographic Dashboard: Top 5 States",
    title_font_size=20,
    title_font_color="#06b6d4",
    height=900,
    showlegend=True,
    legend=dict(
        orientation="h", 
        yanchor="bottom", 
        y=-0.15, 
        xanchor="center", 
        x=0.5
    )
)

fig_dashboard.update_xaxes(tickangle=-45)
fig_dashboard.update_yaxes(title_text="Population", row=1, col=1)
fig_dashboard.update_yaxes(title_text="Population", row=1, col=2)
fig_dashboard.update_yaxes(title_text="Population", row=2, col=1)
fig_dashboard.update_yaxes(title_text="Ratio", row=2, col=2)

fig_dashboard.show()

Key Insights

ImportantWhat the Data Tells Us

1. Population Concentration

The top five states account for a significant portion of the national population. This geographic concentration should inform resource allocation decisions.

2. Youth Demographics

The age distribution reveals a young population—typical of developing nations. This “youth bulge” represents both opportunity (workforce potential) and challenge (education, employment needs).

3. Gender Balance

Most states show relatively balanced gender distributions, with some variation that may reflect migration patterns, conflict impacts, or data collection methodology.

4. Regional Disparities

Substantial population differences between states suggest the need for differentiated development strategies rather than uniform national approaches.


Conclusion

Congratulations! You’ve completed a full demographic analysis using professional data science techniques.

What You Learned:

✅ Loading data from URLs with pd.read_csv()

✅ Exploring data structure with .head(), .dtypes, .isna()

✅ Cleaning data with method chaining and pyjanitor

Filtering with .loc[] — explicit boolean conditions

Filtering with .query() — readable string expressions

✅ Aggregating with .groupby() and .agg()

✅ Reshaping with .pivot() and .melt()

✅ Creating interactive visualizations with Plotly

✅ Building multi-panel dashboards

The techniques you’ve learned apply to ANY dataset—from business analytics to scientific research.

TipNext Steps
  1. Download the code and try it with your own data
  2. Experiment with different Plotly chart types
  3. Explore pandas’ full method chaining capabilities
  4. Practice! The best way to learn is by doing.

Technical Reference

Libraries Used

Library Version Purpose
pandas 3.0+ Data manipulation with method chaining
pyjanitor 0.26+ Clean column names and data cleaning helpers
Plotly 5.17+ Interactive, web-ready visualizations
Great Tables 0.8+ Professional table formatting

Key Techniques Demonstrated

Technique Description
Method Chaining Connect operations with . for readable pipelines
clean_names() Standardize column names to snake_case
Lambda Functions Inline calculations within assign()
.loc[] Filtering Explicit boolean conditions for row selection
.query() Filtering Readable string expressions with @ variable support
groupby().agg() Flexible aggregation with named outputs
pivot() / melt() Reshape between long and wide formats

Tutorial by Alierwai Reng, Founder of PyStatR+. For data science training or collaboration, reach out at info@pystatrplus.org


TipAbout the Author

Alierwai Reng is the Founder and Lead Educator of PyStatR+, a data science educator, and analytics leader with expertise in statistics and healthcare analytics. His mission is to make technical knowledge accessible through clear, beginner-friendly education. He believes in “Education from the Heart.”

For training, consulting, or collaboration opportunities: 📧 info@pystatrplus.org 🌐 pystatrplus.org

NoteEditor’s Note

This tutorial reflects PyStatR+’s core philosophy: that data science education should be accessible, practical, and empowering. We believe the best learning happens when complexity is distilled into clarity—without sacrificing rigor.

At PyStatR+, we teach from the heart by putting ourselves in your shoes—because learning is a partnership, not a solitary journey.

PyStatR+: Learning Simplified. Communication Amplified.