Interactive Demographic Explorer: South Sudan 2008 Census

A Beginner’s Guide to Data Analysis with Pandas Method Chaining and Plotly

Python

Pandas

Interactive Visualization

Demographics

Census

Learn data analysis step-by-step! This beginner-friendly tutorial explores South Sudan’s demographics using pandas method chaining for clean, readable code and Plotly for interactive visualizations.

Author

Alierwai Reng

Published

November 20, 2024

Introduction

Welcome to this hands-on data analysis tutorial! By the end of this guide, you’ll understand how to:

Load and explore real-world census data
Clean data using pandas method chaining (a professional, readable approach)
Calculate summary statistics and group-level metrics
Create interactive visualizations that invite exploration

We’ll analyze South Sudan’s 2008 census data, but the techniques you learn apply to any dataset.

What is Method Chaining?

Method chaining connects multiple operations in a single, flowing statement—like a recipe where each step feeds into the next. Instead of creating many intermediate variables, we write clean, readable pipelines.

Part 1: Environment Setup

Step 1: Import Required Libraries

Every Python analysis starts with importing the tools we need. Think of libraries as specialized toolboxes—each designed for specific tasks.

Code

# Data manipulation
import pandas as pd          # The core library for data analysis
import janitor               # Adds helper functions like clean_names()

# Visualization
import plotly.express as px           # Quick, beautiful charts
import plotly.graph_objects as go     # Advanced chart customization
from plotly.subplots import make_subplots  # Multi-panel dashboards

# Tables
from great_tables import GT   # Professional table formatting

print("✅ All libraries loaded successfully!")
print(f"1) 📊 Pandas version: {pd.__version__};\n2) janitor version: {janitor.__version__}")

✅ All libraries loaded successfully!
1) 📊 Pandas version: 3.0.0rc1;
2) janitor version: 0.32.8

Library Purposes

Library	What It Does
pandas	Load, clean, transform, and analyze data
janitor	Simplifies common cleaning tasks (like fixing column names)
plotly	Creates interactive, web-ready charts
great_tables	Makes publication-quality tables

Step 2: Configure a Custom Theme

Consistent styling makes your visualizations look professional. We’ll create a PyStatR+ branded theme.

Code

import plotly.io as pio

# Define custom color palette and fonts
pio.templates["pystatr"] = go.layout.Template(
    layout=go.Layout(
        font=dict(family="Poppins, sans-serif", size=12),
        plot_bgcolor="#ffffff",
        paper_bgcolor="#ffffff",
        colorway=["#22d3ee", "#FFD700", "#06b6d4", "#0891b2", "#0e7490"],
        title_font=dict(size=18, color="#06b6d4", family="Poppins"),
    )
)

# Set as default for all charts
pio.templates.default = "pystatr"

print("🎨 Custom theme configured!")

🎨 Custom theme configured!

Part 2: Loading the Data

Step 3: Load Census Data from URL

Real datasets often live on the web. Pandas can read directly from URLs—no download required!

Code

# Define the data source URL
# Using a constant (ALL_CAPS) makes it easy to find and update later
data_url = (
    "https://raw.githubusercontent.com/tongakuot/r_tutorials/"
    "main/06-data-wrangling/00-input/ss_2008_census_data_raw.csv"
)

# Load the data into a DataFrame
# We name it 'df_raw' to indicate this is the original, uncleaned data
df_raw = pd.read_csv(data_url)

# Confirm successful load with basic info
print("✅ Data loaded successfully!")
print(f"📋 Rows: {df_raw.shape[0]:,}")
print(f"📋 Columns: {df_raw.shape[1]}")

✅ Data loaded successfully!
📋 Rows: 453
📋 Columns: 10

Naming Convention: snake_case

We use snake_case (lowercase words separated by underscores) for all variable names. This is the Python standard and makes code more readable.

✅ Good: df_raw, gender_distribution, top_10_states
❌ Avoid: dfRaw, GenderDistribution, Top10States

Part 3: Data Exploration

Before cleaning, we must understand our data. What columns exist? What types are they? Are there issues?

Step 4: Examine Column Names

Code

# View the original column names
print("📝 Original column names:")
print(df_raw.columns.tolist())

📝 Original column names:
['Region', 'Region Name', 'Region - RegionId', 'Variable', 'Variable Name', 'Age', 'Age Name', 'Scale', 'Units', '2008']

Notice any issues? Inconsistent capitalization, spaces, or special characters make columns harder to work with.

Step 5: Check Data Types

Code

# Check what type of data each column contains
print("🔍 Data types:")
print(df_raw.dtypes)

🔍 Data types:
Region                   str
Region Name              str
Region - RegionId        str
Variable                 str
Variable Name            str
Age                      str
Age Name                 str
Scale                    str
Units                    str
2008                 float64
dtype: object

Understanding Data Types

Type	Meaning	Examples
`int64`	Whole numbers	1, 42, 1000
`float64`	Decimal numbers	3.14, 99.9
`object`	Text (strings)	“Male”, “Juba”

Step 6: Preview the Data

Code

# View the first 5 rows
# This gives us a concrete look at actual values
df_raw.head()

	Region	Region Name	Region - RegionId	Variable	Variable Name	Age	Age Name	Scale	Units	2008
0	KN.A2	Upper Nile	SS-NU	KN.B2	Population, Total (Number)	KN.C1	Total	units	Persons	964353.0
1	KN.A2	Upper Nile	SS-NU	KN.B2	Population, Total (Number)	KN.C2	0 to 4	units	Persons	150872.0
2	KN.A2	Upper Nile	SS-NU	KN.B2	Population, Total (Number)	KN.C3	5 to 9	units	Persons	151467.0
3	KN.A2	Upper Nile	SS-NU	KN.B2	Population, Total (Number)	KN.C4	10 to 14	units	Persons	126140.0
4	KN.A2	Upper Nile	SS-NU	KN.B2	Population, Total (Number)	KN.C5	15 to 19	units	Persons	103804.0

Step 7: Check for Missing Values

Code

# Count missing values in each column
missing_counts = df_raw.isna().sum()

print("❓ Missing values per column:")
print(missing_counts)
print(f"\n📊 Total missing values: {missing_counts.sum()}")

❓ Missing values per column:
Region               1
Region Name          1
Region - RegionId    3
Variable             3
Variable Name        3
Age                  3
Age Name             3
Scale                3
Units                3
2008                 3
dtype: int64

📊 Total missing values: 26

Part 4: Data Cleaning with Method Chaining

Now we transform our raw data into analysis-ready format. Method chaining lets us do this elegantly.

Step 8: Clean Column Names

The clean_names() function from pyjanitor standardizes all column names to snake_case.

Code

# Clean column names using method chaining
# The dot (.) connects operations into a pipeline
df = (
    df_raw
    # .clean_names()  # Converts to snake_case, removes special characters
    .rename(columns=lambda c: str(c).lower().replace(" ", "_"))
)

print("✅ Column names cleaned!")
print("📝 New column names:")
print(df.columns.tolist())

✅ Column names cleaned!
📝 New column names:
['region', 'region_name', 'region_-_regionid', 'variable', 'variable_name', 'age', 'age_name', 'scale', 'units', '2008']

Why Method Chaining?

Compare these approaches:

Without chaining (harder to read):

df = df_raw.copy()
df = df.clean_names()
df = df.drop_duplicates()
df = df.reset_index(drop=True)

With chaining (clean and readable):

df = (
    df_raw
    .rename(columns=lambda c: str(c).lower().replace(" ", "_"))
    .drop_duplicates()
    .reset_index(drop=True)
)

The chained version reads like a recipe—each step flows naturally to the next.

Step 9: Identify String Columns

Before cleaning text values, we need to know which columns contain text.

Code

# Find columns with string (object) data type
string_columns = df.select_dtypes(include=['str']).columns.tolist()

print(f"📝 String columns to clean: {string_columns}")

📝 String columns to clean: ['region', 'region_name', 'region_-_regionid', 'variable', 'variable_name', 'age', 'age_name', 'scale', 'units']

Step 10: Clean String Values

Text data often has inconsistencies: extra spaces, mixed capitalization. Let’s fix that!

Code

# Clean all string columns using method chaining
# - .str.strip() removes leading/trailing whitespace
# - .str.title() capitalizes first letter of each word

df_cleaned = (
    df
    .assign(**{
        col: df[col].str.strip().str.title() 
        for col in string_columns
    })
)

print("✅ String values cleaned!")
df_cleaned.head()

✅ String values cleaned!

	region	region_name	region_-_regionid	variable	variable_name	age	age_name	scale	units	2008
0	Kn.A2	Upper Nile	Ss-Nu	Kn.B2	Population, Total (Number)	Kn.C1	Total	Units	Persons	964353.0
1	Kn.A2	Upper Nile	Ss-Nu	Kn.B2	Population, Total (Number)	Kn.C2	0 To 4	Units	Persons	150872.0
2	Kn.A2	Upper Nile	Ss-Nu	Kn.B2	Population, Total (Number)	Kn.C3	5 To 9	Units	Persons	151467.0
3	Kn.A2	Upper Nile	Ss-Nu	Kn.B2	Population, Total (Number)	Kn.C4	10 To 14	Units	Persons	126140.0
4	Kn.A2	Upper Nile	Ss-Nu	Kn.B2	Population, Total (Number)	Kn.C5	15 To 19	Units	Persons	103804.0

What’s Happening in .assign()?

The assign() method creates or modifies columns. The **{...} syntax unpacks a dictionary, letting us apply the same operation to multiple columns at once.

This is equivalent to:

df['state'] = df['state'].str.strip().str.title()
df['gender'] = df['gender'].str.strip().str.title()
# ... and so on

Step 11a: Ensure Correct Data Types

Sometimes numeric columns get read as text. Let’s ensure population is numeric.

Code

# Convert population to numeric type
# errors='coerce' turns invalid values into NaN instead of crashing
df_cleaned = (
    df_cleaned
    .assign(population=lambda d: d["2008"].round().astype("Int64"))
)

print(f"✅ Population column type: {df_cleaned['population'].dtype}")
print(f"❓ Missing values after conversion: {df_cleaned['population'].isna().sum()}")

✅ Population column type: Int64
❓ Missing values after conversion: 3

Step 11b: Use the new pandas pd.col() method

Code

# Convert population to numeric type
# errors='coerce' turns invalid values into NaN instead of crashing
df_pd_col = (
    df_raw.assign(
        population=pd.col("2008").round().astype("Int64")
    )
)

print(f"✅ Population column type: {df_pd_col['population'].dtype}")
print(f"❓ Missing values after conversion: {df_pd_col['population'].isna().sum()}")

✅ Population column type: Int64
❓ Missing values after conversion: 3

Step 12: Drop Missing Values and Remove Aggregated Rows

Before analysis, we need to:

Remove rows with missing population values — incomplete data skews results
Filter out “Total” rows — these are pre-calculated aggregates that cause double-counting

Code

# Check current shape before cleaning
print(f"📊 Before cleaning: {df_cleaned.shape[0]:,} rows")
print(f"❓ Rows with missing population: {df_cleaned['population'].isna().sum()}")

📊 Before cleaning: 453 rows
❓ Rows with missing population: 3

Code

# Preview what's in variable_name to understand the data
print("🔍 Unique values in variable_name:")
print(df_cleaned['variable_name'].unique())

🔍 Unique values in variable_name:
<ArrowStringArray>
[ 'Population, Total (Number)',   'Population, Male (Number)',
 'Population, Female (Number)',                           nan]
Length: 4, dtype: str

Method 1: Using .dropna() and .query() (Recommended)

Code

# Drop NAs and filter out Total rows using method chaining
df_cleaned = (
    df_cleaned
    .dropna(subset=['population'])  # Remove rows where population is NaN
    .query("~variable_name.str.contains('Total', case=False)")  # Remove Total rows
    .reset_index(drop=True)
)

print(f"✅ After cleaning: {df_cleaned.shape[0]:,} rows")

✅ After cleaning: 300 rows

Method 2: Using .dropna() and .loc[] (Alternative)

Code

# Equivalent approach using .loc[] (shown for learning, not executed)
dropna_df = (
    df_cleaned
    .dropna(subset=['population'])
    .loc[lambda x: ~x['variable_name'].str.contains('Total', case=False)]
    .reset_index(drop=True)
)

Why Remove “Total” Rows?

Census data often includes both detail rows (Male, Female) and aggregated Total rows.

If we keep Totals when calculating our own summaries:

Male: 100 + Female: 100 + Total: 200 = 400 ❌ (should be 200)

By removing Total rows first, we can calculate accurate statistics from scratch.

Code

# Verify only Male/Female remain
print("📝 Remaining values in variable_name:")
print(df_cleaned['variable_name'].unique())

📝 Remaining values in variable_name:
<ArrowStringArray>
['Population, Male (Number)', 'Population, Female (Number)']
Length: 2, dtype: str

Variable Progression in This Tutorial

As we clean our data, we use descriptive variable names to track the state: - df_raw → Original data loaded from URL - df → Column names cleaned (snake_case) - df_cleaned → String values cleaned + data types fixed - census_df → Final dataset with extracted gender column (using method chaining) - df → Reassigned for convenient analysis

Notice how we use method chaining to go directly from df_cleaned to census_df without intermediate copies—this is the power of pandas method chaining!

Step 12: String Splitting - Extracting Gender Information

Now let’s explore the variable_name column and extract meaningful information from it. This column contains structured text like “Population - Male (Number)” that we can split and parse.

Code

# Examine what's in the variable_name column
print("🔍 Unique values in variable_name column:")
print(df_cleaned['variable_name'].unique())

🔍 Unique values in variable_name column:
<ArrowStringArray>
['Population, Male (Number)', 'Population, Female (Number)']
Length: 2, dtype: str

String Splitting Strategy

The variable_name column follows a pattern: “Category - Gender (Type)”

We want to extract the middle piece (Male, Female, Total) using various pandas string methods. Each approach has different use cases!

Method 1: Using `.str.split()` with Expand

The simplest approach: split on delimiter and select the piece you want.

Code

# Split on ' - ' and extract the second part (index 1)
df_method_1a = (
    df_cleaned
    .assign(
        # Split creates columns; we select the second one (index 1)
        gender_extracted=lambda x: x['variable_name'].str.split(' ', expand=True).get(1)
    )
)

print("✅ Method 1: Basic split with expand")
print(df_method_1a[['variable_name', 'gender_extracted']].drop_duplicates())

✅ Method 1: Basic split with expand
                  variable_name gender_extracted
0     Population, Male (Number)             Male
15  Population, Female (Number)           Female

Code

# Split on ' - ' and extract the second part (index 1)
df_method_1b = (
    df_cleaned
    .assign(
        # Split creates columns; we select the second one (index 1)
        gender_extracted=lambda x: x['variable_name'].str.split(' ', expand=True)[1]
    )
)

print("✅ Method 1: Basic split with expand")
print(df_method_1b[['variable_name', 'gender_extracted']].drop_duplicates())

✅ Method 1: Basic split with expand
                  variable_name gender_extracted
0     Population, Male (Number)             Male
15  Population, Female (Number)           Female

How .str.split() Works

split(' ') breaks the string at each occurrence of ’ ’
expand=True creates separate columns (like a DataFrame)
.get(1) | [1] selects the second column (Python uses 0-based indexing)

Method 2: Chain Multiple Splits

Sometimes you need to clean up the extracted text further.

Code

# Split twice: first on ' - ', then remove the parenthetical part
df_method_2 = (
    df_cleaned
    .assign(
        # First split to get middle section
        temp_extract=lambda x: x['variable_name'].str.split(' ', expand=True)[1],
        # Then split on ' (' to remove the (Number) part
        gender_clean=lambda x: x['temp_extract'].str.split(' \\(', expand=True)[0]
    )
    .drop(columns=['temp_extract'])  # Remove temporary column
)

print("✅ Method 3: Chained splits for clean extraction")
print(df_method_2[['variable_name', 'gender_clean']].drop_duplicates())

✅ Method 3: Chained splits for clean extraction
                  variable_name gender_clean
0     Population, Male (Number)         Male
15  Population, Female (Number)       Female

Method 3: Using `.str.extract()` with Regex

For more complex patterns, regular expressions provide powerful extraction.

Code

# Use regex pattern to extract text between ' - ' and ' ('
df_method_3 = (
    df_cleaned
    .assign(
        gender_regex=lambda x: x['variable_name'].str.extract(r' (\w+) \(', expand=False)
    )
)

print("✅ Method 3: Regex extraction")
print(df_method_3[['variable_name', 'gender_regex']].drop_duplicates())

✅ Method 3: Regex extraction
                  variable_name gender_regex
0     Population, Male (Number)         Male
15  Population, Female (Number)       Female

Understanding the Regex Pattern

r' (\w+) \(' breaks down as: - r = raw string (treats backslashes literally) - = matches literal ” ” - (\w+) = captures one or more word characters (letters/digits) - \( = matches literal ” (” (backslash escapes the parenthesis)

The parentheses () create a “capture group” that .extract() returns.

Method 4: Using `.str.replace()` and String Slicing

Remove unwanted parts and extract what remains.

Code

# Remove the prefix and suffix to isolate the gender
df_method_4 = (
    df_cleaned
    .assign(
        gender_replace=lambda x: (
            x['variable_name']
            .str.replace('Population, ', '', regex=False)  # Remove prefix
            .str.replace(r' \(Number\)', '', regex=True)     # Remove suffix
        )
    )
)

print("✅ Method 4: String replacement")
print(df_method_4[['variable_name', 'gender_replace']].drop_duplicates())

✅ Method 4: String replacement
                  variable_name gender_replace
0     Population, Male (Number)           Male
15  Population, Female (Number)         Female

Method 5: One-liner with Method Chaining

Combine everything into a compact, readable pipeline. This creates our final cleaned dataframe: census_df.

Code

# The most elegant solution: all-in-one chained extraction
# This is our final cleaned dataset with extracted gender information
census_df = (
    df_cleaned
    .assign(
        gender=lambda x: (
            x['variable_name']
            .str.split(' ', expand=True)[1]  # Extract middle piece
            .str.split(' \(', expand=True)[0] # Remove parenthetical
            .str.strip()                     # Remove any extra whitespace
        )
    )
)

print("✅ Method 5: Complete chained solution")
print(census_df[['variable_name', 'gender']].drop_duplicates())

# Verify the extraction worked
print("\n🎯 Unique gender values extracted:")
print(census_df['gender'].unique())

✅ Method 5: Complete chained solution
                  variable_name  gender
0     Population, Male (Number)    Male
15  Population, Female (Number)  Female

🎯 Unique gender values extracted:
<ArrowStringArray>
['Male', 'Female']
Length: 2, dtype: str

Which Method Should You Use?

Choose based on your needs:

Method	Best For	Pros	Cons
Method 1	Simple, consistent patterns	Fast, readable	Requires fixed structure
Method 2	Multi-step cleaning	Very clear logic	More verbose
Method 3	Complex patterns	Most flexible	Requires regex knowledge
Method 4	Known prefix/suffix	Intuitive	Less robust to variations
Method 5	Production code	Compact, professional	Harder for beginners

For this tutorial, Method 5 is ideal: it’s clean, readable, and professional.

Step 13: Verify the Extraction

Let’s confirm our extracted gender values match what we expect.

Code

# Check the distribution of gender values
gender_counts = (
    census_df
    .groupby('gender', as_index=False)
    .agg(
        count=('gender', 'size'),
        total_population=('population', 'sum')
    )
    .sort_values('total_population', ascending=False)
)

print("📊 Gender distribution in extracted data:")
print(gender_counts)

📊 Gender distribution in extracted data:
   gender  count  total_population
1    Male    150           8574600
0  Female    150           7946380

Step 14: Prepare Data for Analysis

Let’s assign our cleaned dataframe with the extracted gender column back to df for convenience in subsequent analyses:

Code

# Use the cleaned dataframe with extracted gender column for all subsequent analyses
print("✅ Ready for analysis!")
print(f"📋 Final dataset: {census_df.shape[0]:,} rows × {df.shape[1]} columns")

✅ Ready for analysis!
📋 Final dataset: 300 rows × 10 columns

Step 15: Final Cleaned Dataset Summary

Code

# Summary of our cleaned data with extracted gender column
print("=" * 50)
print("🎉 DATA CLEANING COMPLETE!")
print("=" * 50)
print(f"📋 Final shape: {census_df.shape[0]:,} rows × {census_df.shape[1]} columns")
print(f"📝 Columns: {census_df.columns.tolist()}")
print("\n🔍 Preview:")
df.head()

==================================================
🎉 DATA CLEANING COMPLETE!
==================================================
📋 Final shape: 300 rows × 12 columns
📝 Columns: ['region', 'region_name', 'region_-_regionid', 'variable', 'variable_name', 'age', 'age_name', 'scale', 'units', '2008', 'population', 'gender']

🔍 Preview:

	region	region_name	region_-_regionid	variable	variable_name	age	age_name	scale	units	2008
0	KN.A2	Upper Nile	SS-NU	KN.B2	Population, Total (Number)	KN.C1	Total	units	Persons	964353.0
1	KN.A2	Upper Nile	SS-NU	KN.B2	Population, Total (Number)	KN.C2	0 to 4	units	Persons	150872.0
2	KN.A2	Upper Nile	SS-NU	KN.B2	Population, Total (Number)	KN.C3	5 to 9	units	Persons	151467.0
3	KN.A2	Upper Nile	SS-NU	KN.B2	Population, Total (Number)	KN.C4	10 to 14	units	Persons	126140.0
4	KN.A2	Upper Nile	SS-NU	KN.B2	Population, Total (Number)	KN.C5	15 to 19	units	Persons	103804.0

Part 5: Filtering Data — Two Approaches

Before diving into analysis, let’s learn how to filter (subset) data. Pandas offers two main approaches: .loc[] and .query(). Both are powerful—choose based on readability and context.

Step 16: Filtering with `.loc[]`

The .loc[] accessor filters rows using boolean conditions. It’s explicit and works well for simple filters.

Code

# Filter rows where gender is 'Male'
# .loc[condition] returns rows where condition is True

males_only = census_df.loc[census_df['gender'] == 'Male']

print(f"📊 Total rows: {len(df):,}")
print(f"👨 Male rows: {len(males_only):,}")
males_only.head()

📊 Total rows: 453
👨 Male rows: 150

	region	region_name	region_-_regionid	variable	variable_name	age	age_name	scale	units	2008	population	gender
0	Kn.A2	Upper Nile	Ss-Nu	Kn.B5	Population, Male (Number)	Kn.C1	Total	Units	Persons	525430.0	525430	Male
1	Kn.A2	Upper Nile	Ss-Nu	Kn.B5	Population, Male (Number)	Kn.C2	0 To 4	Units	Persons	82690.0	82690	Male
2	Kn.A2	Upper Nile	Ss-Nu	Kn.B5	Population, Male (Number)	Kn.C3	5 To 9	Units	Persons	83744.0	83744	Male
3	Kn.A2	Upper Nile	Ss-Nu	Kn.B5	Population, Male (Number)	Kn.C4	10 To 14	Units	Persons	71027.0	71027	Male
4	Kn.A2	Upper Nile	Ss-Nu	Kn.B5	Population, Male (Number)	Kn.C5	15 To 19	Units	Persons	57387.0	57387	Male

Code

# Rename columns 
census_df = (
    census_df
    .rename(columns={
            'region_name': 'state',
            'age_name': 'age_category'
        }
    )
    .query("age_category != 'Total'")
    # select columns of interest
    [['state', 'gender', 'age_category', 'population']]
)

# Inspect output
print(census_df)

                 state  gender age_category  population
1           Upper Nile    Male       0 To 4       82690
2           Upper Nile    Male       5 To 9       83744
3           Upper Nile    Male     10 To 14       71027
4           Upper Nile    Male     15 To 19       57387
5           Upper Nile    Male     20 To 24       42521
..                 ...     ...          ...         ...
295  Eastern Equatoria  Female     45 To 49       13727
296  Eastern Equatoria  Female     50 To 54        9482
297  Eastern Equatoria  Female     55 To 59        5740
298  Eastern Equatoria  Female     60 To 64        5274
299  Eastern Equatoria  Female          65+        8637

[280 rows x 4 columns]

Code

# Multiple conditions with .loc[]
# Use & (and), | (or), ~ (not)
# IMPORTANT: Wrap each condition in parentheses!

# Filter: Males in 'Jonglei' state
males_in_jonglei = census_df.loc[
    ((census_df['gender'] == 'Male') & (census_df['state'] == 'Jonglei'))
]

print(f"👨 Males in Jonglei: {len(males_in_jonglei):,} rows")
males_in_jonglei.head()

👨 Males in Jonglei: 14 rows

	state	gender	age_category	population
31	Jonglei	Male	0 To 4	115806
32	Jonglei	Male	5 To 9	120885
33	Jonglei	Male	10 To 14	101752
34	Jonglei	Male	15 To 19	80739
35	Jonglei	Male	20 To 24	62047

.loc[] Syntax Rules

Operation	Syntax
Equals	`df['col'] == 'value'`
Not equals	`df['col'] != 'value'`
Greater than	`df['col'] > 100`
Multiple AND	`(cond1) & (cond2)`
Multiple OR	`(cond1) \\| (cond2)`
NOT	`~(condition)`

Step 17: Filtering with `.query()`

The .query() method uses a string expression—often more readable, especially for complex filters.

Code

# Same filter using .query() - cleaner syntax!
males_only_query = census_df.query("gender == 'Male'")

print(f"👨 Male rows (via query): {len(males_only_query):,}")
males_only_query.head()

👨 Male rows (via query): 140

	state	gender	age_category	population
1	Upper Nile	Male	0 To 4	82690
2	Upper Nile	Male	5 To 9	83744
3	Upper Nile	Male	10 To 14	71027
4	Upper Nile	Male	15 To 19	57387
5	Upper Nile	Male	20 To 24	42521

Code

# Multiple conditions with .query() - much cleaner!
# No parentheses needed, use 'and', 'or', 'not'

males_in_jonglei_query = census_df.query("gender == 'Male' and state == 'Jonglei'")

print(f"👨 Males in Jonglei (via query): {len(males_in_jonglei_query):,} rows")
males_in_jonglei_query.head()

👨 Males in Jonglei (via query): 14 rows

	state	gender	age_category	population
31	Jonglei	Male	0 To 4	115806
32	Jonglei	Male	5 To 9	120885
33	Jonglei	Male	10 To 14	101752
34	Jonglei	Male	15 To 19	80739
35	Jonglei	Male	20 To 24	62047

Step 18: Using Variables in `.query()` with `@`

A powerful feature: reference Python variables inside query strings using @.

Code

# Define filter criteria as variables
target_state = 'Upper Nile'
min_population = 1000

# Use @ to reference variables inside the query string
filtered_df = census_df.query("state == @target_state and population >= @min_population")

print(f"🎯 Rows matching criteria: {len(filtered_df):,}")
filtered_df.head()

🎯 Rows matching criteria: 28

	state	gender	age_category	population
1	Upper Nile	Male	0 To 4	82690
2	Upper Nile	Male	5 To 9	83744
3	Upper Nile	Male	10 To 14	71027
4	Upper Nile	Male	15 To 19	57387
5	Upper Nile	Male	20 To 24	42521

Code

# Filter using a list of values with 'in'
states_of_interest = ['Jonglei', 'Upper Nile', 'Unity']

# .loc approach - use .isin()
result_loc = census_df.loc[census_df['state'].isin(states_of_interest)]

# .query approach - use 'in @list'
result_query = census_df.query("state in @states_of_interest")

print(f"📊 Results match: {len(result_loc) == len(result_query)}")
print(f"📋 Rows found: {len(result_query):,}")

📊 Results match: True
📋 Rows found: 84

When to Use Each Method

Use `.loc[]` When…	Use `.query()` When…
Simple, single conditions	Complex, multi-condition filters
You need to select specific columns too	Readability is priority
Working with column names that have spaces	Filter logic reads like English
Programmatic/dynamic column selection	Using variables with `@`

In method chains, .query() is often cleaner:

# Clean and readable
result = (
    df
    .query("state == 'Jonglei' and gender == 'Male'")
    .groupby('age_category')
    .agg(population=('population', 'sum'))
)

Step 19: Combining `.loc[]` with Column Selection

A unique advantage of .loc[]: select rows AND columns simultaneously.

Code

# Select specific rows AND columns with .loc[rows, columns]
subset = census_df.loc[
    census_df['state'] == 'Jonglei',           # Row condition
    ['state', 'gender', 'population']    # Columns to keep
]

print("📊 Filtered rows with selected columns:")
subset.head()

📊 Filtered rows with selected columns:

	state	gender	population
31	Jonglei	Male	115806
32	Jonglei	Male	120885
33	Jonglei	Male	101752
34	Jonglei	Male	80739
35	Jonglei	Male	62047

Code

# Equivalent with .query() + column selection
subset_query = (
    census_df
    .query("state == 'Jonglei'")
    [['state', 'gender', 'population']]  # Column selection after query
)

subset_query.head()

	state	gender	population
31	Jonglei	Male	115806
32	Jonglei	Male	120885
33	Jonglei	Male	101752
34	Jonglei	Male	80739
35	Jonglei	Male	62047

Part 6: Summary Statistics

Now that our data is clean, let’s calculate some high-level metrics.

Step 20: Calculate Overview Metrics

Code

# Build a summary table of key statistics
summary_stats = pd.DataFrame({
    'metric': [
        'Total Population',
        'Number of States', 
        'Age Categories',
        'Gender Groups'
    ],
    'value': [
        census_df['population'].sum(),
        census_df['state'].nunique(),
        census_df['age_category'].nunique(),
        census_df['gender'].nunique()
    ]
})

summary_stats

	metric	value
0	Total Population	8260490
1	Number of States	10
2	Age Categories	14
3	Gender Groups	2

Step 21: Display as Formatted Table

Let’s make this look professional with Great Tables.

Code

(
    GT(summary_stats)
    .tab_header(
        title="South Sudan 2008 Census Overview",
        subtitle="Key Summary Statistics"
    )
    .fmt_number(
        columns="value",
        rows=[0],  # Only format the first row (population)
        decimals=0,
        use_seps=True
    )
    .cols_label(
        metric="Metric",
        value="Value"
    )
    .cols_align(align="center")
)

Metric	Value
South Sudan 2008 Census Overview
Key Summary Statistics
Total Population	8,260,490
Number of States	10
Age Categories	14
Gender Groups	2

Step 22: Age Category Cleaning

Method 1: Using .replace() with a mapping dictionary

Code

# Define the age category mapping (matching actual data format: '0 To 4')
age_mapping = {
    '0 To 4': '0-14',
    '5 To 9': '0-14',
    '10 To 14': '0-14',
    '15 To 19': '15-24',
    '20 To 24': '15-24',
    '25 To 29': '25-34',
    '30 To 34': '25-34',
    '35 To 39': '35-44',
    '40 To 44': '35-44',
    '45 To 49': '45-54',
    '50 To 54': '45-54',
    '55 To 59': '55-64',
    '60 To 64': '55-64',
    '65+': '65+'  # Already in correct format
}

# Apply the mapping using .replace()
census_new_age_categories = (
    census_df
    .assign(
        age_category=lambda x: x['age_category'].replace(age_mapping)
    )
)

print("✅ Age categories standardized!")
print("\n🎯 New age categories:")
print(sorted(census_new_age_categories['age_category'].unique()))

✅ Age categories standardized!

🎯 New age categories:
['0-14', '15-24', '25-34', '35-44', '45-54', '55-64', '65+']

Method 2: Using case_when() (Pandas 2.2+)

Code

## Using case_when for conditional transformations
# case_when is called on a Series instance, not as a class method
census_clean = (
    census_df
    .assign(
        age_category=lambda x: x['age_category'].case_when(
            caselist=[
                (x['age_category'].isin(['0 To 4', '5 To 9', '10 To 14']), '0-14'),
                (x['age_category'].isin(['15 To 19', '20 To 24']), '15-24'),
                (x['age_category'].isin(['25 To 29', '30 To 34']), '25-34'),
                (x['age_category'].isin(['35 To 39', '40 To 44']), '35-44'),
                (x['age_category'].isin(['45 To 49', '50 To 54']), '45-54'),
                (x['age_category'].isin(['55 To 59', '60 To 64']), '55-64'),
                (x['age_category'] == '65+', '65+'),
            ]
        )
    )
)

print("✅ Age categories standardized using case_when!")
print("\n🎯 New age categories:")
print(sorted(census_clean['age_category'].unique()))

✅ Age categories standardized using case_when!

🎯 New age categories:
['0-14', '15-24', '25-34', '35-44', '45-54', '55-64', '65+']

Part 7: Gender Analysis

Let’s explore how the population breaks down by gender.

Step 22: Calculate National Gender Distribution

This demonstrates a complete method chaining pipeline:

Code

# Aggregate population by gender
# This pipeline: groups → sums → sorts → calculates percentage

gender_distribution = (
    census_clean
    .groupby('gender', as_index=False)           # Group rows by gender
    .agg(population=('population', 'sum'))        # Sum population in each group
    .sort_values('population', ascending=False)   # Sort largest first
    .assign(
        percentage=lambda x: (x['population'] / x['population'].sum() * 100).round(2)
    )
    .reset_index(drop=True)                       # Clean up the index
)

gender_distribution

	gender	population	percentage
0	Male	4287300	51.9
1	Female	3973190	48.1

Breaking Down the Pipeline

Step	What It Does
`.groupby('gender')`	Creates groups: one for Male, one for Female
`.agg(population=('population', 'sum'))`	Sums the population column within each group
`.sort_values(...)`	Reorders rows by population (descending)
`.assign(percentage=...)`	Adds a new column with calculated percentages
`.reset_index(drop=True)`	Creates a clean 0, 1, 2… index

Step 23: Display Gender Table

Code

(
    GT(gender_distribution)
    .tab_header(
        title="National Gender Distribution",
        subtitle="Population and Percentages"
    )
    .fmt_number(columns="population", decimals=0, use_seps=True)
    .fmt_number(columns="percentage", decimals=2)
    .cols_label(
        gender="Gender",
        population="Population",
        percentage="Percentage (%)"
    )
)

Gender	Population	Percentage (%)
National Gender Distribution
Population and Percentages
Male	4,287,300	51.90
Female	3,973,190	48.10

Step 24: Visualize with Interactive Donut Chart

Interactive charts let viewers explore data themselves—hover, click, zoom!

Code

# Create a donut chart (pie chart with a hole)
fig_gender = px.pie(
    gender_distribution,
    values='population',
    names='gender',
    title='Gender Distribution Across South Sudan',
    color='gender',
    color_discrete_map={'Male': '#22d3ee', 'Female': '#FFD700'},
    hole=0.4  # Creates the donut hole
)

# Customize appearance
fig_gender.update_traces(
    textposition='inside',
    textinfo='percent+label',
    marker=dict(line=dict(color='#ffffff', width=2))
)

fig_gender.update_layout(
    title_font_size=18,
    title_font_color="#06b6d4",
    height=500,
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=-0.15,
        xanchor="center",
        x=0.5
    )
)

fig_gender.show()

Part 8: State-Level Gender Analysis

Now let’s dig deeper—how does gender distribution vary by state?

Step 25: Aggregate by State and Gender

Code

# Step 18a: Group by both state AND gender
state_gender_long = (
    census_clean
    .groupby(['state', 'gender'], as_index=False)
    .agg(population=('population', 'sum'))
)

print("📊 Long format (each state-gender combo is one row):")
state_gender_long.head(10)

📊 Long format (each state-gender combo is one row):

	state	gender	population
0	Central Equatoria	Female	521835
1	Central Equatoria	Male	581722
2	Eastern Equatoria	Female	440974
3	Eastern Equatoria	Male	465187
4	Jonglei	Female	624275
5	Jonglei	Male	734327
6	Lakes	Female	329850
7	Lakes	Male	365880
8	Northern Bahr El Ghazal	Female	372608
9	Northern Bahr El Ghazal	Male	348290

Step 26: Pivot to Wide Format

Sometimes we need data in “wide” format—one row per state, separate columns for each gender.

Code

# Pivot: rows become states, columns become genders
state_gender_wide = (
    state_gender_long
    .pivot(index='state', columns='gender', values='population')
    .reset_index()
    # .clean_names()  # Standardize the new column names
    .rename(columns=lambda c: c.lower())
)

print("📊 Wide format (one row per state):")
state_gender_wide.head()

📊 Wide format (one row per state):

gender	state	female	male
0	Central Equatoria	521835	581722
1	Eastern Equatoria	440974	465187
2	Jonglei	624275	734327
3	Lakes	329850	365880
4	Northern Bahr El Ghazal	372608	348290

Step 27: Calculate Derived Metrics

Now we can calculate useful metrics like gender ratio.

Code

# Add calculated columns
state_gender = (
    state_gender_wide
    .assign(
        total=lambda x: x['male'] + x['female'],
        male_pct=lambda x: (x['male'] / x['total'] * 100).round(2),
        female_pct=lambda x: (x['female'] / x['total'] * 100).round(2),
        gender_ratio=lambda x: (x['male'] / x['female'] * 100).round(2)  # Males per 100 females
    )
    .sort_values('total', ascending=False)
    .reset_index(drop=True)
)

print("📊 State-level gender analysis:")
state_gender.head(10)

📊 State-level gender analysis:

gender	state	female	male	total	male_pct	female_pct	gender_ratio
0	Jonglei	624275	734327	1358602	54.05	45.95	117.63
1	Central Equatoria	521835	581722	1103557	52.71	47.29	111.48
2	Warrap	502194	470734	972928	48.38	51.62	93.74
3	Upper Nile	438923	525430	964353	54.49	45.51	119.71
4	Eastern Equatoria	440974	465187	906161	51.34	48.66	105.49
5	Northern Bahr El Ghazal	372608	348290	720898	48.31	51.69	93.47
6	Lakes	329850	365880	695730	52.59	47.41	110.92
7	Western Equatoria	300586	318443	619029	51.44	48.56	105.94
8	Unity	285554	300247	585801	51.25	48.75	105.15
9	Western Bahr El Ghazal	156391	177040	333431	53.1	46.9	113.2

Understanding Gender Ratio

A gender ratio of 100 means equal males and females. - Above 100 = more males - Below 100 = more females

Step 28: Display Top 10 States Table

Code

top_5_states = state_gender.head(5)

(
    GT(top_5_states)
    .tab_header(
        title="Top 5 States by Population",
        subtitle="Gender Breakdown and Ratios"
    )
    .fmt_number(columns=['male', 'female', 'total'], decimals=0, use_seps=True)
    .fmt_number(columns=['male_pct', 'female_pct', 'gender_ratio'], decimals=2)
    .cols_label(
        state="State",
        male="Male",
        female="Female",
        total="Total",
        male_pct="Male %",
        female_pct="Female %",
        gender_ratio="Ratio (M/100F)"
    )
    .data_color(
        columns="gender_ratio",
        palette=["#FFD700", "#ffffff", "#22d3ee"]
    )
)

State	Female	Male	Total	Male %	Female %	Ratio (M/100F)
Top 5 States by Population
Gender Breakdown and Ratios
Jonglei	624,275	734,327	1,358,602	54.05	45.95	117.63
Central Equatoria	521,835	581,722	1,103,557	52.71	47.29	111.48
Warrap	502,194	470,734	972,928	48.38	51.62	93.74
Upper Nile	438,923	525,430	964,353	54.49	45.51	119.71
Eastern Equatoria	440,974	465,187	906,161	51.34	48.66	105.49

Step 29: Visualize State Gender Distribution

Code

# Reshape data for plotting (back to long format)
state_gender_plot = (
    state_gender
    .head(5)
    .melt(
        id_vars=['state'],
        value_vars=['male', 'female'],
        var_name='gender',
        value_name='population'
    )
    .assign(gender=lambda x: x['gender'].str.title())  # Capitalize for display
)

# Create grouped bar chart
fig_state_gender = px.bar(
    state_gender_plot,
    x='state',
    y='population',
    color='gender',
    title='Population by State and Gender (Top 5 States)',
    labels={'population': 'Population', 'state': 'State'},
    color_discrete_map={'Male': '#22d3ee', 'Female': '#FFD700'},
    barmode='group'
)

fig_state_gender.update_layout(
    title_font_size=18,
    title_font_color="#06b6d4",
    xaxis_tickangle=-45,
    height=600,
    hovermode='x unified',
    legend=dict(
        title="Gender",
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    )
)

fig_state_gender.update_traces(
    hovertemplate='<b>%{x}</b><br>Population: %{y:,}<extra></extra>'
)

fig_state_gender.show()

Part 9: Age Category Analysis

Step 30: Calculate National Age Distribution

Code

# Aggregate by age category
age_distribution = (
    census_clean
    .groupby('age_category', as_index=False)
    .agg(population=('population', 'sum'))
    .sort_values('population', ascending=False)
    .assign(
        percentage=lambda x: (x['population'] / x['population'].sum() * 100).round(2)
    )
    .reset_index(drop=True)
)

age_distribution

	age_category	population	percentage
0	0-14	3659337	44.3
1	15-24	1628835	19.72
2	25-34	1234926	14.95
3	35-44	815517	9.87
4	45-54	473365	5.73
5	55-64	237426	2.87
6	65+	211084	2.56

Step 31: Display Age Distribution Table

Code

(
    GT(age_distribution)
    .tab_header(
        title="Population Distribution by Age Category",
        subtitle="National Summary"
    )
    .fmt_number(columns="population", decimals=0, use_seps=True)
    .fmt_number(columns="percentage", decimals=2)
    .cols_label(
        age_category="Age Category",
        population="Population",
        percentage="Percentage (%)"
    )
    .data_color(
        columns="percentage",
        palette=["#000000", "#0891b2", "#22d3ee", "#FFD700"]
    )
)

Age Category	Population	Percentage (%)
Population Distribution by Age Category
National Summary
0-14	3,659,337	44.30
15-24	1,628,835	19.72
25-34	1,234,926	14.95
35-44	815,517	9.87
45-54	473,365	5.73
55-64	237,426	2.87
65+	211,084	2.56

Step 32: Visualize Age Distribution

Code

# Create horizontal bar chart (easier to read category names)
fig_age = px.bar(
    age_distribution.sort_values('population'),  # Sort for visual clarity
    x='population',
    y='age_category',
    orientation='h',
    title='Population Distribution by Age Category',
    labels={'population': 'Population', 'age_category': 'Age Category'},
    color='population',
    color_continuous_scale=['#000000', '#22d3ee', '#FFD700']
)

fig_age.update_layout(
    title_font_size=18,
    title_font_color="#06b6d4",
    height=500,
    showlegend=False
)

fig_age.update_traces(
    hovertemplate='<b>%{y}</b><br>Population: %{x:,}<extra></extra>'
)

fig_age.show()

Part 10: Age and Gender Interaction

The population pyramid is a classic demographic visualization.

Step 33: Prepare Age-Gender Cross-Tabulation

Code

# Aggregate by both age and gender
age_gender_long = (
    census_clean
    .groupby(['age_category', 'gender'], as_index=False)
    .agg(population=('population', 'sum'))
)

# Pivot and calculate metrics
age_gender = (
    age_gender_long
    .pivot(index='age_category', columns='gender', values='population')
    .reset_index()
    # .clean_names()
    .rename(columns=lambda c: c.lower())
    .assign(
        total=lambda x: x['male'] + x['female'],
        female_pct=lambda x: (x['female'] / x['total'] * 100).round(2)
    )
    .sort_values('total', ascending=False)
    .reset_index(drop=True)
)

age_gender

gender	age_category	female	male	total	female_pct
0	0-14	1711988	1947349	3659337	46.78
1	15-24	805145	823690	1628835	49.43
2	25-34	640997	593929	1234926	51.91
3	35-44	403039	412478	815517	49.42
4	45-54	218612	254753	473365	46.18
5	55-64	105345	132081	237426	44.37
6	65+	88064	123020	211084	41.72

Step 34: Create Population Pyramid

A population pyramid shows males on the left, females on the right—a powerful way to visualize demographic structure.

Code

# Prepare data: males get negative values (to extend left)
pyramid_data = (
    age_gender_long
    .assign(
        population_adjusted=lambda x: x.apply(
            lambda row: -row['population'] if row['gender'] == 'Male' else row['population'],
            axis=1
        )
    )
)

# Separate by gender
male_pyramid = pyramid_data.query("gender == 'Male'")
female_pyramid = pyramid_data.query("gender == 'Female'")

# Build the pyramid
fig_pyramid = go.Figure()

# Male bars (left side)
fig_pyramid.add_trace(
    go.Bar(
        y=male_pyramid['age_category'],
        x=male_pyramid['population_adjusted'],
        name='Male',
        orientation='h',
        marker=dict(color='#22d3ee'),
        hovertemplate='<b>Male - %{y}</b><br>Population: %{customdata:,}<extra></extra>',
        customdata=male_pyramid['population']
    )
)

# Female bars (right side)
fig_pyramid.add_trace(
    go.Bar(
        y=female_pyramid['age_category'],
        x=female_pyramid['population'],
        name='Female',
        orientation='h',
        marker=dict(color='#FFD700'),
        hovertemplate='<b>Female - %{y}</b><br>Population: %{x:,}<extra></extra>'
    )
)

fig_pyramid.update_layout(
    title='Population Pyramid: Age and Gender Distribution',
    title_font_size=18,
    title_font_color="#06b6d4",
    xaxis=dict(
        title='Population',
        tickvals=[-400000, -300000, -200000, -100000, 0, 100000, 200000, 300000, 400000],
        ticktext=['400K', '300K', '200K', '100K', '0', '100K', '200K', '300K', '400K']
    ),
    yaxis=dict(title='Age Category'),
    barmode='overlay',
    height=600,
    bargap=0.1,
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    )
)

fig_pyramid.show()

Part 11: State Rankings

Step 35: Create State Population Rankings

Code

# Calculate comprehensive state rankings
state_ranking = (
    census_clean
    .groupby('state', as_index=False)
    .agg(population=('population', 'sum'))
    .sort_values('population', ascending=False)
    .assign(
        rank=lambda x: range(1, len(x) + 1),
        percentage=lambda x: (x['population'] / x['population'].sum() * 100).round(2),
        cumulative_pct=lambda x: x['percentage'].cumsum().round(2)
    )
    .reset_index(drop=True)
)

state_ranking.head(5)

	state	population	rank	percentage	cumulative_pct
0	Jonglei	1358602	1	16.45	16.45
1	Central Equatoria	1103557	2	13.36	29.81
2	Warrap	972928	3	11.78	41.59
3	Upper Nile	964353	4	11.67	53.26
4	Eastern Equatoria	906161	5	10.97	64.23

Step 36: Display Top 10 States Table

Code

top_5 = state_ranking.head(5)

(
    GT(top_5)
    .tab_header(
        title="Top 5 Most Populous States",
        subtitle="Population, Share, and Cumulative Distribution"
    )
    .fmt_number(columns="population", decimals=0, use_seps=True)
    .fmt_number(columns=["percentage", "cumulative_pct"], decimals=2)
    .cols_label(
        rank="Rank",
        state="State",
        population="Population",
        percentage="% of Total",
        cumulative_pct="Cumulative %"
    )
    .data_color(
        columns="population",
        palette=["#000000", "#0891b2", "#22d3ee", "#FFD700"]
    )
)

State	Population	Rank	% of Total	Cumulative %
Top 5 Most Populous States
Population, Share, and Cumulative Distribution
Jonglei	1,358,602	1	16.45	16.45
Central Equatoria	1,103,557	2	13.36	29.81
Warrap	972,928	3	11.78	41.59
Upper Nile	964,353	4	11.67	53.26
Eastern Equatoria	906,161	5	10.97	64.23

Step 37: Visualize with Treemap

Treemaps show hierarchical data as nested rectangles—size represents magnitude.

Code

fig_treemap = px.treemap(
    top_5,
    path=['state'],
    values='population',
    title='State Population Distribution (Top 10)',
    color='population',
    color_continuous_scale=['#000000', '#0891b2', '#22d3ee', '#FFD700']
)

fig_treemap.update_traces(
    textposition="middle center",
    texttemplate="<b>%{label}</b><br>%{value:,}",
    hovertemplate='<b>%{label}</b><br>Population: %{value:,}<br>Share: %{percentRoot:.2%}<extra></extra>'
)

fig_treemap.update_layout(
    title_font_size=18,
    title_font_color="#06b6d4",
    height=600
)

fig_treemap.show()

Part 12: State-Level Age Distribution

Step 38: Analyze Age in Top 5 States

Code

# Get top 5 states by population
top_5_states = state_ranking.head(5)['state'].tolist()

print(f"📍 Top 5 states: {top_5_states}")

📍 Top 5 states: ['Jonglei', 'Central Equatoria', 'Warrap', 'Upper Nile', 'Eastern Equatoria']

Code

# Calculate age distribution within each top state
state_age_distribution = (
    census_clean
    .query("state in @top_5_states")  # Filter to top 5 only
    .groupby(['state', 'age_category'], as_index=False)
    .agg(population=('population', 'sum'))
    .assign(
        percentage=lambda x: (
            x.groupby('state')['population']
            .transform(lambda pop: pop / pop.sum() * 100)
            .round(2)
        )
    )
)

state_age_distribution.head(10)

	state	age_category	population	percentage
0	Central Equatoria	0-14	463463	42.0
1	Central Equatoria	15-24	240239	21.77
2	Central Equatoria	25-34	181599	16.46
3	Central Equatoria	35-44	110539	10.02
4	Central Equatoria	45-54	59749	5.41
5	Central Equatoria	55-64	27963	2.53
6	Central Equatoria	65+	20005	1.81
7	Eastern Equatoria	0-14	400689	44.22
8	Eastern Equatoria	15-24	197580	21.8
9	Eastern Equatoria	25-34	130718	14.43

The query() Method

query("state in @top_5_states") is a readable way to filter data. The @ symbol references a Python variable inside the query string.

This is equivalent to:

df[df['state'].isin(top_5_states)]

Step 39: Visualize State Age Distribution

Code

fig_state_age = px.bar(
    state_age_distribution,
    x='state',
    y='population',
    color='age_category',
    title='Age Distribution in Top 5 States',
    labels={
        'population': 'Population', 
        'state': 'State', 
        'age_category': 'Age Category'
    },
    text='percentage',
    color_discrete_sequence=['#000000', '#0891b2', '#22d3ee', '#06b6d4', '#FFD700', '#f59e0b']
)

fig_state_age.update_traces(
    texttemplate='%{text:.1f}%',
    textposition='inside',
    hovertemplate='<b>%{x}</b><br>%{fullData.name}<br>Population: %{y:,}<br>Percentage: %{text:.2f}%<extra></extra>'
)

fig_state_age.update_layout(
    title_font_size=18,
    title_font_color="#06b6d4",
    xaxis_tickangle=-45,
    height=600,
    barmode='stack',
    legend=dict(
        title="Age Category",
        orientation="v",
        yanchor="middle",
        y=0.5,
        xanchor="left",
        x=1.02
    )
)

fig_state_age.show()

Part 13: Comprehensive Dashboard

Let’s combine everything into a single, multi-panel dashboard.

Step 40: Build Multi-Panel Dashboard

Code

# Filter to top 5 states
top_5_df = census_clean.query("state in @top_5_states")

# Create 2x2 subplot grid
fig_dashboard = make_subplots(
    rows=2, cols=2,
    subplot_titles=(
        'Total Population by State',
        'Gender Distribution',
        'Age Distribution',
        'Gender Ratio (Males per 100 Females)'
    ),
    specs=[
        [{"type": "bar"}, {"type": "bar"}],
        [{"type": "bar"}, {"type": "bar"}]
    ],
    vertical_spacing=0.12,
    horizontal_spacing=0.1
)

# ----- Panel 1: Total Population -----
panel1_data = (
    top_5_df
    .groupby('state', as_index=False)
    .agg(population=('population', 'sum'))
    .sort_values('population', ascending=False)
)

fig_dashboard.add_trace(
    go.Bar(
        x=panel1_data['state'], 
        y=panel1_data['population'], 
        marker_color='#22d3ee', 
        name='Total Pop',
        showlegend=False
    ),
    row=1, col=1
)

# ----- Panel 2: Gender Distribution -----
panel2_data = (
    top_5_df
    .groupby(['state', 'gender'], as_index=False)
    .agg(population=('population', 'sum'))
)

for gender, color in [('Male', '#22d3ee'), ('Female', '#FFD700')]:
    gender_data = panel2_data.query("gender == @gender")
    fig_dashboard.add_trace(
        go.Bar(
            x=gender_data['state'], 
            y=gender_data['population'], 
            name=gender, 
            marker_color=color
        ),
        row=1, col=2
    )

# ----- Panel 3: Age Distribution (Stacked) -----
panel3_data = (
    top_5_df
    .groupby(['state', 'age_category'], as_index=False)
    .agg(population=('population', 'sum'))
)

age_categories = panel3_data['age_category'].unique()
colors = ['#000000', '#0891b2', '#22d3ee', '#06b6d4', '#FFD700']

for i, age_cat in enumerate(age_categories[:5]):
    age_data = panel3_data.query("age_category == @age_cat")
    fig_dashboard.add_trace(
        go.Bar(
            x=age_data['state'], 
            y=age_data['population'], 
            name=age_cat, 
            marker_color=colors[i % len(colors)],
            showlegend=False
        ),
        row=2, col=1
    )

# ----- Panel 4: Gender Ratio -----
panel4_data = (
    top_5_df
    .groupby(['state', 'gender'], as_index=False)
    .agg(population=('population', 'sum'))
    .pivot(index='state', columns='gender', values='population')
    .reset_index()
    # .clean_names()
    .rename(columns=lambda c: c.lower())
    .assign(
        ratio=lambda x: (x['male'] / x['female'] * 100).round(2)
    )
)

fig_dashboard.add_trace(
    go.Bar(
        x=panel4_data['state'], 
        y=panel4_data['ratio'], 
        marker_color='#FFD700', 
        name='Gender Ratio',
        showlegend=False
    ),
    row=2, col=2
)

# ----- Layout -----
fig_dashboard.update_layout(
    title_text="Comprehensive Demographic Dashboard: Top 5 States",
    title_font_size=20,
    title_font_color="#06b6d4",
    height=900,
    showlegend=True,
    legend=dict(
        orientation="h", 
        yanchor="bottom", 
        y=-0.15, 
        xanchor="center", 
        x=0.5
    )
)

fig_dashboard.update_xaxes(tickangle=-45)
fig_dashboard.update_yaxes(title_text="Population", row=1, col=1)
fig_dashboard.update_yaxes(title_text="Population", row=1, col=2)
fig_dashboard.update_yaxes(title_text="Population", row=2, col=1)
fig_dashboard.update_yaxes(title_text="Ratio", row=2, col=2)

fig_dashboard.show()

Key Insights

What the Data Tells Us

1. Population Concentration

The top five states account for a significant portion of the national population. This geographic concentration should inform resource allocation decisions.

2. Youth Demographics

The age distribution reveals a young population—typical of developing nations. This “youth bulge” represents both opportunity (workforce potential) and challenge (education, employment needs).

3. Gender Balance

Most states show relatively balanced gender distributions, with some variation that may reflect migration patterns, conflict impacts, or data collection methodology.

4. Regional Disparities

Substantial population differences between states suggest the need for differentiated development strategies rather than uniform national approaches.

Conclusion

Congratulations! You’ve completed a full demographic analysis using professional data science techniques.

What You Learned:

✅ Loading data from URLs with pd.read_csv()

✅ Exploring data structure with .head(), .dtypes, .isna()

✅ Cleaning data with method chaining and pyjanitor

✅ Filtering with .loc[] — explicit boolean conditions

✅ Filtering with .query() — readable string expressions

✅ Aggregating with .groupby() and .agg()

✅ Reshaping with .pivot() and .melt()

✅ Creating interactive visualizations with Plotly

✅ Building multi-panel dashboards

The techniques you’ve learned apply to ANY dataset—from business analytics to scientific research.

Next Steps

Download the code and try it with your own data
Experiment with different Plotly chart types
Explore pandas’ full method chaining capabilities
Practice! The best way to learn is by doing.

Technical Reference

Libraries Used

Library	Version	Purpose
pandas	3.0+	Data manipulation with method chaining
pyjanitor	0.26+	Clean column names and data cleaning helpers
Plotly	5.17+	Interactive, web-ready visualizations
Great Tables	0.8+	Professional table formatting

Key Techniques Demonstrated

Technique	Description
Method Chaining	Connect operations with `.` for readable pipelines
`clean_names()`	Standardize column names to snake_case
Lambda Functions	Inline calculations within `assign()`
`.loc[]` Filtering	Explicit boolean conditions for row selection
`.query()` Filtering	Readable string expressions with `@` variable support
`groupby().agg()`	Flexible aggregation with named outputs
`pivot()` / `melt()`	Reshape between long and wide formats

Tutorial by Alierwai Reng, Founder of PyStatR+. For data science training or collaboration, reach out at info@pystatrplus.org

About the Author

Alierwai Reng is the Founder and Lead Educator of PyStatR+, a data science educator, and analytics leader with expertise in statistics and healthcare analytics. His mission is to make technical knowledge accessible through clear, beginner-friendly education. He believes in “Education from the Heart.”

For training, consulting, or collaboration opportunities: 📧 info@pystatrplus.org 🌐 pystatrplus.org

Editor’s Note

This tutorial reflects PyStatR+’s core philosophy: that data science education should be accessible, practical, and empowering. We believe the best learning happens when complexity is distilled into clarity—without sacrificing rigor.

At PyStatR+, we teach from the heart by putting ourselves in your shoes—because learning is a partnership, not a solitary journey.

PyStatR+: Learning Simplified. Communication Amplified.

Introduction

Part 1: Environment Setup

Step 1: Import Required Libraries

Step 2: Configure a Custom Theme

Part 2: Loading the Data

Step 3: Load Census Data from URL

Part 3: Data Exploration

Step 4: Examine Column Names

Step 5: Check Data Types

Step 6: Preview the Data

Step 7: Check for Missing Values

Part 4: Data Cleaning with Method Chaining

Step 8: Clean Column Names

Step 9: Identify String Columns

Step 10: Clean String Values

Step 11a: Ensure Correct Data Types

Step 11b: Use the new pandas pd.col() method

Step 12: Drop Missing Values and Remove Aggregated Rows

Step 12: String Splitting - Extracting Gender Information

Method 1: Using .str.split() with Expand

Method 2: Chain Multiple Splits

Method 3: Using .str.extract() with Regex

Method 4: Using .str.replace() and String Slicing

Method 5: One-liner with Method Chaining

Step 13: Verify the Extraction

Step 14: Prepare Data for Analysis

Step 15: Final Cleaned Dataset Summary

Part 5: Filtering Data — Two Approaches

Step 16: Filtering with .loc[]

Step 17: Filtering with .query()

Step 18: Using Variables in .query() with @

Step 19: Combining .loc[] with Column Selection

Part 6: Summary Statistics

Step 20: Calculate Overview Metrics

Step 21: Display as Formatted Table

Step 22: Age Category Cleaning

Method 1: Using .replace() with a mapping dictionary

Method 2: Using case_when() (Pandas 2.2+)

Part 7: Gender Analysis

Step 22: Calculate National Gender Distribution

Step 23: Display Gender Table

Step 24: Visualize with Interactive Donut Chart

Part 8: State-Level Gender Analysis

Step 25: Aggregate by State and Gender

Step 26: Pivot to Wide Format

Step 27: Calculate Derived Metrics

Step 28: Display Top 10 States Table

Step 29: Visualize State Gender Distribution

Part 9: Age Category Analysis

Step 30: Calculate National Age Distribution

Step 31: Display Age Distribution Table

Step 32: Visualize Age Distribution

Part 10: Age and Gender Interaction

Step 33: Prepare Age-Gender Cross-Tabulation

Step 34: Create Population Pyramid

Part 11: State Rankings

Step 35: Create State Population Rankings

Step 36: Display Top 10 States Table

Step 37: Visualize with Treemap

Part 12: State-Level Age Distribution

Step 38: Analyze Age in Top 5 States

Step 39: Visualize State Age Distribution

Part 13: Comprehensive Dashboard

Step 40: Build Multi-Panel Dashboard

Key Insights

1. Population Concentration

2. Youth Demographics

3. Gender Balance

4. Regional Disparities

Conclusion

Technical Reference

Libraries Used

Key Techniques Demonstrated

Method 1: Using `.str.split()` with Expand

Method 3: Using `.str.extract()` with Regex

Method 4: Using `.str.replace()` and String Slicing

Step 16: Filtering with `.loc[]`

Step 17: Filtering with `.query()`

Step 18: Using Variables in `.query()` with `@`

Step 19: Combining `.loc[]` with Column Selection